Cache format#

The cache format is a binary format used by VW which is aimed at being fast to read. It can change between releases and is not portable. It is a useful tool to speed things up when a dataset may be processed more than once.

For example, the following table shows the time it takes to parse the first 10k lines of the RCV1 dataset.




11.8 ms +- 0.3 ms


31.5 ms +- 1.0 ms

There is support for both reading and writing the cache format. vowpal_wabbit_next.CacheFormatWriter can be used to create a cache file, and vowpal_wabbit_next.CacheFormatReader to read a cache file.

For example, to create a cache file of a sample dataset in VW text format:

import vowpal_wabbit_next as vw

dataset = [
    "0 | price:.23 sqft:.25 age:.05 2006",
    "1 | price:.18 sqft:.15 age:.35 1976",
    "0 | price:.53 sqft:.32 age:.87 1924",

workspace = vw.Workspace([])
text_parser = vw.TextFormatParser(workspace)

with open("data.cache", "wb") as cache_file:
    with vw.CacheFormatWriter(workspace, cache_file) as writer:
        for line in dataset:

Then, to load a learn from the dataset:

workspace = vw.Workspace([])
text_parser = vw.TextFormatParser(workspace)

with open("data.cache", "rb") as cache_file:
    with vw.CacheFormatReader(workspace, cache_file) as reader:
        for example in reader: