Backtesting

Backtesting

Overview

Deploying a new workflow version is scary: how do you know the new version performs better than the old one? After you’ve run a sanity test to rule out obvious problems, backtesting can help you test the workflow against a dataset to gain more confidence in the new version.

Prepare the dataset

A dataset contains historical data you’ll use to test a workflow version. You can either use executions or a CSV file as the dataset.

To be able to join the test results with the labels in your data warehouse, we suggest including an input feature that can serve as the join key. This input feature is usually an ID, such as an application ID. Make sure to add this feature as an input feature.

Backtesting with executions

To run backtests with executions, simply select a date range of executions. Sperta supports large-scale backtesting with hundreds of millions of executions.

This method supports workflows that contain data sources. However, an execution will be skipped if it doesn’t contain the data source response the current backtest needs.

In the following example, if a new workflow version approves more applications in the Knockout Rules stage, the backtest for this version may skip the executions that declined the application in the Knockout Rules stage since they don’t have the credit data. When this happens, the status column in the backtesting result will be InvalidArgument .

image

Backtesting with CSV

icon
You can’t backtest workflows containing data sources using CSV files.

When you first start using Sperta, Sperta may not have enough executions. So, instead, you can upload your own dataset as CSV.

A CSV file looks like this:

email_domain,fraud_score,credit_score,age,past_due_amount
hotmail.com,0.2,721,35,100.0
gmail.com,0.6,675,45,12.34
yahoo.com,0.4,801,28,0.0

The first line of the CSV contains the input feature IDs separated by commas. Each following line represents the feature values of a sample, such as an application or transaction. We suggest exporting the CSV from your data warehouse, and there are a few details to pay attention to:

  • For boolean values, we support the following formats: true, false, TRUE, and FALSE. true and false are natively supported in the Sperta Expression Language. TRUE and FALSE are also supported since some spreadsheets software and data warehouses automatically convert boolean values to this format.
  • CSV doesn’t support complex feature types (List , Person etc). You shouldn’t add them to the CSV even if the workflow contains such features. Sperta will automatically supply an empty list for List features during the backtest.
  • The maximum file size allowed is currently 1 MB. As a rule of thumb, a 1 MB CSV file can contain 30K rows with dozens of input features (columns).

Analyze the test result

After the backtest finishes, you can download the test result as a CSV file:

image

The results are appended as extra columns to the CSV, and it looks like this:

email_domain,fraud_score,credit_score,age,past_due_amount,status,decision,blocked_country,email_domain_in_block_list,loan_outcomes.apr,loan_outcomes.loan_amount
hotmail.com,0.2,721,35,0.0,OK,Approve,false,false,0.1,10000
gmail.com,0.6,675,45,12.34,OK,Approve,false,false,0.15,8000
yahoo.com,0.4,801,28,0.0,OK,Approve,false,false,0.1,10000

Specifically:

  • status indicates if the current sample was successfully backtested. Just like workflow execution, errors could happen due to various reasons such as feature type mismatch. You should discard the sample if the status is not OK.
  • decision is the decision of the workflow.
  • The columns after decision are outputs of the workflow. In this example, they are blocked_country, email_domain_in_block_list, loan_outcomes.apr, and loan_outcomes.loan_amount.
  • For input features, only features of primitive types (
  • For input features, only features of primitive types (Boolean , String , Integer , Double ) are included, even if they’re available in executions.

You can import the CSV to your data warehouse or BI tools, join it with labels, and compute metrics such as precision, recall, approval rate, and delinquency rate.