Using vw-estimators
#
vw-estimators
is a library of off-policy estimators for various problems including contextual bandits. They can be used to evaluate target policies against a logged contextual bandit dataset. This library includes confidence bounds in addition to the estimators. In this example we process a trivial example dataset and feed the results into an IPS estimator and CressieRead confidence interval.
extract_label
is a function to translate how VW represents the contextual bandit label information into a more familiar form.
from typing import List, Optional, Tuple
import vowpal_wabbit_next as vw
from estimators.bandits import ips, cressieread
# VW's labels contain extra info, and are associated with each example.
# This function extracts the logical CB label from the example list.
# Assumes examples have CBLabel typed labels.
def extract_label(examples: List[vw.Example]) -> Optional[Tuple[int, float, float]]:
first_is_shared = len(examples) > 0 and examples[0].get_label().shared
for i, example in enumerate(examples):
if (label := example.get_label().label) is not None:
_, cost, prob = label
return (i - (1 if first_is_shared else 0), cost, prob)
return None
We’ll use the following trivial input for this example. There are two actions, each identified by a single feature. We’re using a StringIO so we can treat this as if we were reading it from a file with a vowpal_wabbit_next.TextFormatReader
.
import io
input = io.StringIO(
"""shared | s
0:1:0.5 | a=0
| a=1
shared | s
| a=0
1:0:0.5 | a=1
shared | s
0:1:0.5 | a=0
| a=1
shared | s
| a=0
1:0:0.5 | a=1"""
)
See comments for an explanation of the process.
workspace = vw.Workspace(["--cb_explore_adf"])
estimator = ips.Estimator()
interval = cressieread.Interval(empirical_r_bounds=True)
estimates = []
lower = []
upper = []
with vw.TextFormatReader(workspace, input) as reader:
for event in reader:
logged_label = extract_label(event)
# 1. Check if this event is labelled, if not skip it
if logged_label is None:
continue
# 2. Predict and learn on the event
pmf = workspace.predict_then_learn_one(event)
# 3. Extract the logged cost and the probability of choosing it according to the logged policy
logged_action_0_based, logged_cost, logged_prob = logged_label
# 4. Get the probability of choosing the logged action according to the target policy
prediction_prob = next(x for i, x in pmf if i == logged_action_0_based)
# 5. Feed these values into the estimator and confidence interval
# Note: These operate with rewards so we multiply cost by -1 to convert to reward
estimator.add_example(logged_prob, logged_cost * -1, prediction_prob)
interval.add_example(logged_prob, logged_cost * -1, prediction_prob)
print(f"Estimate: {estimator.get()}")
bounds = interval.get()
print(f"Lower bound: {bounds[0]}")
print(f"Upper bound: {bounds[1]}")
Estimate: -0.2625000001862645
Lower bound: -1.0
Upper bound: 0.3219763298424875