Experiments

Experiments

Overview

Experiments (a.k.a A/B testing) help you test different workflow versions on the live traffic to find the best-performing version.

While it takes time for experiments to accumulate enough data, it complements backtesting in two ways:

  • If you don’t have enough historical data or the historical data doesn’t reflect new trends, you can run experiments instead of backtests.
  • Experiments help you measure the performance accurately with ground truth. For example, a new version of a loan application workflow may approve more users, and you can measure the delinquency rate of this segment through experiments. That won’t be possible in backtesting since those users weren’t approved in the first place.

We recommend running backtests first and then running experiments.

Create an experiment

To create an experiment, you first need to select a bucketing feature. This is an input feature (e.g., user_id) used to split the traffic into experiment groups. Only features of the String type can be used as the bucketing feature. Requests with the same bucketing feature value will always be treated with the same experiment group.

The control group should run the baseline workflow version, and treatment groups should run workflow versions that contain improvements. Each experiment group can have 1% to 99% of the traffic. We recommend having an even split among experiment groups so that you can reach statistical significance faster.

image

Deploy an experiment

To deploy an experiment, create a new deployment and select the experiment you just created. The experiment will replace the currently deployed workflow version or experiment.

image

Conclude an experiment

While the experiment is running, the experiment ID and workflow version ID will be returned in the workflow execution result. You can ingest the results into your data warehouse to join them with labels such as chargebacks and delinquency to calculate metrics such as precision, recall, approval rate, and delinquency rate.

Note that you need enough data in each experiment group to reach statistical significance. For now, you need to calculate that yourself.

Once you can see a metric lift in the treatment groups, you can deploy the workflow version of the best-performing treatment group.