Cache format#

The cache format is a binary format used by VW which is aimed at being fast to read. It can change between releases and is not portable. It is a useful tool to speed things up when a dataset may be processed more than once.

For example, the following table shows the time it takes to parse the first 10k lines of the RCV1 dataset.

Input

Time

Cache

11.8 ms +- 0.3 ms

Text

31.5 ms +- 1.0 ms

There is support for both reading and writing the cache format. vowpal_wabbit_next.CacheFormatWriter can be used to create a cache file, and vowpal_wabbit_next.CacheFormatReader to read a cache file.

For example, to create a cache file of a sample dataset in VW text format:

import vowpal_wabbit_next as vw

dataset = [
    "0 | price:.23 sqft:.25 age:.05 2006",
    "1 | price:.18 sqft:.15 age:.35 1976",
    "0 | price:.53 sqft:.32 age:.87 1924",
]

workspace = vw.Workspace()
text_parser = vw.TextFormatParser(workspace)

with open("data.cache", "wb") as cache_file:
    with vw.CacheFormatWriter(workspace, cache_file) as writer:
        for line in dataset:
            writer.write_example(text_parser.parse_line(line))

Then, to load a learn from the dataset:

workspace = vw.Workspace()
text_parser = vw.TextFormatParser(workspace)

with open("data.cache", "rb") as cache_file:
    with vw.CacheFormatReader(workspace, cache_file) as reader:
        for example in reader:
            print(workspace.predict_then_learn_one(example))
0.0
0.0
1.0