Using vw-estimators#

vw-estimators is a library of off-policy estimators for various problems including contextual bandits. They can be used to evaluate target policies against a logged contextual bandit dataset. This library includes confidence bounds in addition to the estimators. In this example we process a trivial example dataset and feed the results into an IPS estimator and CressieRead confidence interval.

extract_label is a function to translate how VW represents the contextual bandit label information into a more familiar form.

from typing import List, Optional, Tuple
import vowpal_wabbit_next as vw
from estimators.bandits import ips, cressieread


# VW's labels contain extra info, and are associated with each example.
# This function extracts the logical CB label from the example list.
# Assumes examples have CBLabel typed labels.
def extract_label(examples: List[vw.Example]) -> Optional[Tuple[int, float, float]]:
    first_is_shared = len(examples) > 0 and examples[0].get_label().shared
    for i, example in enumerate(examples):
        if (label := example.get_label().label) is not None:
            _, cost, prob = label
            return (i - (1 if first_is_shared else 0), cost, prob)
    return None

We’ll use the following trivial input for this example. There are two actions, each identified by a single feature. We’re using a StringIO so we can treat this as if we were reading it from a file with a vowpal_wabbit_next.TextFormatReader.

import io

input = io.StringIO(
    """shared | s
0:1:0.5 | a=0
| a=1

shared | s
| a=0
1:0:0.5 | a=1

shared | s
0:1:0.5 | a=0
| a=1

shared | s
| a=0
1:0:0.5 | a=1"""
)

See comments for an explanation of the process.

workspace = vw.Workspace(["--cb_explore_adf"])
estimator = ips.Estimator()
interval = cressieread.Interval(empirical_r_bounds=True)

estimates = []
lower = []
upper = []

with vw.TextFormatReader(workspace, input) as reader:
    for event in reader:
        logged_label = extract_label(event)

        # 1. Check if this event is labelled, if not skip it
        if logged_label is None:
            continue

        # 2. Predict and learn on the event
        pmf = workspace.predict_then_learn_one(event)

        # 3. Extract the logged cost and the probability of choosing it according to the logged policy
        logged_action_0_based, logged_cost, logged_prob = logged_label

        # 4. Get the probability of choosing the logged action according to the target policy
        prediction_prob = next(x for i, x in pmf if i == logged_action_0_based)

        # 5. Feed these values into the estimator and confidence interval
        # Note: These operate with rewards so we multiply cost by -1 to convert to reward
        estimator.add_example(logged_prob, logged_cost * -1, prediction_prob)
        interval.add_example(logged_prob, logged_cost * -1, prediction_prob)

print(f"Estimate: {estimator.get()}")
bounds = interval.get()
print(f"Lower bound: {bounds[0]}")
print(f"Upper bound: {bounds[1]}")
Estimate: -0.2625000001862645
Lower bound: -1.0
Upper bound: 0.3219763298424875