Cache format#
The cache format is a binary format used by VW which is aimed at being fast to read. It can change between releases and is not portable. It is a useful tool to speed things up when a dataset may be processed more than once.
For example, the following table shows the time it takes to parse the first 10k lines of the RCV1 dataset.
Input |
Time |
---|---|
Cache |
11.8 ms +- 0.3 ms |
Text |
31.5 ms +- 1.0 ms |
There is support for both reading and writing the cache format. vowpal_wabbit_next.CacheFormatWriter
can be used to create a cache file, and vowpal_wabbit_next.CacheFormatReader
to read a cache file.
For example, to create a cache file of a sample dataset in VW text format:
import vowpal_wabbit_next as vw
dataset = [
"0 | price:.23 sqft:.25 age:.05 2006",
"1 | price:.18 sqft:.15 age:.35 1976",
"0 | price:.53 sqft:.32 age:.87 1924",
]
workspace = vw.Workspace()
text_parser = vw.TextFormatParser(workspace)
with open("data.cache", "wb") as cache_file:
with vw.CacheFormatWriter(workspace, cache_file) as writer:
for line in dataset:
writer.write_example(text_parser.parse_line(line))
Then, to load a learn from the dataset:
workspace = vw.Workspace()
text_parser = vw.TextFormatParser(workspace)
with open("data.cache", "rb") as cache_file:
with vw.CacheFormatReader(workspace, cache_file) as reader:
for example in reader:
print(workspace.predict_then_learn_one(example))
0.0
0.0
1.0