Generative and Predictive AI Accelerators | DataRobot AI Platform https://www.datarobot.com/ai-accelerators/ Deliver Value from AI Wed, 28 Feb 2024 16:04:05 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 Object Classification on Video with DataRobot Visual AI https://www.datarobot.com/ai-accelerators/object-classification-on-video-with-datarobot-visual-ai/ Wed, 28 Feb 2024 16:03:52 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53792 This AI Accelerator demonstrates how deep learning model trained and deployed with DataRobot platform can be used for object detection on the video stream (detection if person in front of camera wears glasses).

The post Object Classification on Video with DataRobot Visual AI appeared first on DataRobot AI Platform.

]]>
DataRobot Visual AI allows you to train deep learning models intended for the Computer Vision projects that are demanded by the different industries. The object detection (binary and multiclass classification) applied to image and video processing is one of the tasks that can be easily and efficiently implemented with the DataRobot Visual AI. You can also bring your own Computer Vision model and deploy it in DataRobot via Custom Model Workshop.

This AI Accelerator demonstrates how deep learning model trained and deployed with DataRobot platform can be used for object detection on the video stream (detection if person in front of camera wears glasses). The Elastic-Net Classifier (L2 / Binomial Deviance) along with Pretrained MobileNetV3-Small-Pruned Multi-Level Global Average Pooling Image Featurizer with no image augmentation have been used for this Accelerator. The dataset used for the training can be found here, it contains images for 2 classes: persons with glasses and persons without glasses. The full size of the original dataset is 13.3 GB, of the cropped dataset is 335.8 MB. The small sample of the cropped dataset (100 images for each class) has been used for this AI Accelerator. The video stream is captured with OpenCV Computer Vision library. The frontend part has been implemented as a Streamlit application. The user of this AI Accelerator is expected to be familiar with the training and deployment process with DataRobot.

Accelerator overview

This accelerator requires:

The following steps outline the accelerator workflow.

  1. Install OpenCV (pip install opencv-python)
  2. Install Streamlit (pip install streamlit)
  3. Train and deploy a deep learning model with DataRobot
  4. Implement Streamlit app (the code source can be found in this repository)
  5. Run application streamlit run app.py -- --deployment_id DEPLOYMENT_ID
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Object Classification on Video with DataRobot Visual AI appeared first on DataRobot AI Platform.

]]>
Prediction Intervals via Conformal Inference https://www.datarobot.com/ai-accelerators/prediction-intervals-via-conformal-inference/ Wed, 28 Feb 2024 13:58:58 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53777 This AI Accelerator demonstrates various ways for generating prediction intervals for any DataRobot model. The methods presented here are rooted in the area of conformal inference (also known as conformal prediction).

The post Prediction Intervals via Conformal Inference appeared first on DataRobot AI Platform.

]]>
These types of approaches have become increasingly popular for uncertainty quantification because they do not require strict distributional assumptions to be met in order to achieve proper coverage (i.e., they only require that the testing data is exchangeable with the training data). While conformal inference can be applied across a wide array of prediction problems, the focus in this notebook will be prediction interval generation on regression targets. This notebook is formatted as follows:

  1. Importing libraries
  2. Notebook parameters and helper functions
  3. Loading the example dataset
  4. Modeling building and making predictions
  5. Method 1: Absolute conformal
  6. Method 2: Signed conformal
  7. Method 3: Locally-weighted conformal
  8. Method 4: Conformalized quantile regression
  9. Comparing methods
  10. Conclusion

Note: the particulars for each method have been simplified (e.g., authors use an “adjusted” quantile level rather than the traditional quantile calculation implemented here). For a full treatment of each approach and specific algorithm details, see the cited reference papers below.

1. Importing libraries

Read about different options for connecting to DataRobot from the client. Load the remaining libraries below in the usual way.

In [1]:

# Establish connection
import datarobot as dr

print(f"DataRobot version: {dr.__version__}")
dr.Client()

DataRobot version: 3.2.0

Out [1]:

<datarobot.rest.RESTClientObject at 0x10c74f970>
In [2]:

# Imports
import datarobot as dr
from datarobot.models.modeljob import wait_for_async_model_creation
import numpy as np
import pandas as pd

2. Notebook parameters and helper functions

Below, you’ll have two parameters for this notebook:

  1. COVERAGE_LEVEL: fraction of prediction intervals that should contain the target
  2. TEST_DATA_FRACTION: fraction of data to holdout out and use as a testing dataset to evaluate each method

In addition, a couple functions are provided to make the following analysis easier. One of which computes the two metrics that will be used for comparison:

  1. Coverage: fraction of computed prediction intervals that contain the target across all rows
  2. Average Width: the average width of the prediction intervals across all rows

A desirable method should achieve the proper coverage at the smallest width possible.

In [3]:

# Coverage level
COVERAGE_LEVEL = 0.9

# Fraction of data to use to evaluate each method
TEST_DATA_FRACTION = 0.2
In [4]:

# Courtesy of https://github.com/yromano/cqr/blob/master/cqr/helper.py


def compute_coverage(
    y_test: np.array,
    y_lower: np.array,
    y_upper: np.array,
    significance: float,
    name: str = "",
) -> (float, float):
    """
    Computes coverage and average width

    Parameters
    ----------
    y_test: true labels (n)
    y_lower: estimated lower bound for the labels (n)
    y_upper: estimated upper bound for the labels (n)
    significance: desired significance level
    name: optional output string (e.g. the method name)

    Returns
    -------
    coverage : average coverage
    avg_width : average width

    """
    # Compute coverage
    in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper))
    coverage = in_the_range / len(y_test) * 100
    print(
        "%s: Percentage in the range (expecting %.2f): %f"
        % (name, 100 - significance * 100, coverage)
    )

    # Compute average width
    avg_width = np.mean(abs(y_upper - y_lower))
    print("%s: Average width: %f" % (name, avg_width))

    return coverage, avg_width


def compute_training_predictions(model: dr.Model) -> pd.DataFrame:
    """
    Computes (or gathers) the out-of-sample training predictions from a model

    Parameters
    ----------
    model: DataRobot model

    Returns
    -------
    DataFrame of training predictions

    """

    # Get project to unlock holdout
    project = dr.Project.get(model.project_id)
    project.unlock_holdout()

    # Request or gather predictions
    try:
        training_predict_job = model.request_training_predictions(
            dr.enums.DATA_SUBSET.ALL
        )
        training_predictions = training_predict_job.get_result_when_complete()

    except dr.errors.ClientError:
        training_predictions = [
            tp
            for tp in dr.TrainingPredictions.list(project.id)
            if tp.model_id == model.id and tp.data_subset == "all"
        ][0]

    return training_predictions.get_all_as_dataframe()


def quantile_rearrangement(
    test_preds: pd.DataFrame,
    quantile_low: float,
    quantile_high: float,
) -> pd.DataFrame:
    """
    Produces monotonic quantiles
    Based on: https://github.com/yromano/cqr/blob/master/cqr/torch_models.py#L66-#L94

    Parameters
    ----------
    test_preds: dataframe of quantile predictions to rearrange, sorted from lowest quantile to highest
    quantile_low: desired low quantile in the range (0,1)
    quantile_high: desired high quantile in the range (0,1)

    Returns
    -------
    Dataframe of rearranged quantile predictions

    References
    ----------
    .. [1]  Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon.
            "Quantile and probability curves without crossing."
            Econometrica 78.3 (2010): 1093-1125.
    """

    # Based on the code in the referenced function, "all_quantiles" is defined as the following:
    # See https://github.com/yromano/cqr/blob/master/cqr/helper.py#L423
    all_quantiles = np.linspace(0.01, 0.99, 99)

    # This part remains remains the same
    scaling = all_quantiles[-1] - all_quantiles[0]
    low_val = (quantile_low - all_quantiles[0]) / scaling
    high_val = (quantile_high - all_quantiles[0]) / scaling

    # Get new values
    q_fixed = np.quantile(
        test_preds.values, (low_val, high_val), method="linear", axis=1
    )

    return pd.DataFrame(q_fixed.T, columns=test_preds.columns)

3. Loading the example dataset

The dataset you’ll use comes from this DataRobot blog post. Each row represents a player in the National Basketball Association (NBA) and the columns signify different NBA statistics from various repositories, fantasy basketball news sources, and betting information. The target, game_score, is a single statistic that attempts to quantify player performance and productivity.

Additionally, you’ll partition the data into a training and testing sets. The training set will be used for modeling building / evaluation while the testing set will be used to compare each method.

In [5]:

# Load data
df = pd.read_csv(
    "https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_NBA_2017-2018.csv"
)
df.head()

Out [5]:

01234
roto_fpts_per_minNaNNaNNaNNaNNaN
roto_minutesNaNNaNNaNNaNNaN
roto_fptsNaNNaNNaNNaNNaN
roto_valueNaNNaNNaNNaNNaN
free_throws_lag30_meanNaN01.522.5
field_goals_decay1_meanNaN686.2857147.2
game_score_lag30_meanNaN8.113.3510.712.15
minutes_played_decay1_meanNaN27.06666735.51111133.38095230.955556
PF_lastseason3.13.13.13.13.1
free_throws_attempted_lag30_meanNaN02.533.25
teamPHOPHOPHOPHOPHO
opponentPORLALLACSACUTA
over_underNaNNaNNaNNaNNaN
eff_field_goal_percent_lastseason0.4750.4750.4750.4750.475
spread_decay1_meanNaN-48-17.333333-31.428571-13.6
positionSGSGSGSGSG
OWS_lastseason1.31.31.31.31.3
free_throws_percent_decay1NaNNaN0.60.6923080.862069
text_yesterday_and_todayNaNNaNNaNNaNNaN
game_score8.118.65.416.58.4
In [6]:

# Distribution of target
target_column = "game_score"
df[target_column].hist()
Out [6]:

<Axes: >
download 12
In [7]:
# Split data
df_train = df.sample(frac=1 - TEST_DATA_FRACTION, replace=False, random_state=10)
df_test = df.loc[~df.index.isin(df_train.index)]
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
print(df_train.shape)
print(df_test.shape)

(7999, 52)

(2000, 52)

4. Modeling building and making predictions

To create a DataRobot project and start building models, you can use the convenient Project.start function, which chains together project creation, file upload, and target selection. Once models are finished training, you’ll retrieve the one DataRobot recommends for deployment and request predictions for both the training and testing sets. Note that the predictions made on the training dataset are not in-sample, but rather out-of-sample (i.e., also referred to as stacked predictions). These out-of-sample training predictions are a key component to each prediction interval method discussed in this notebook.

In [8]:

# Starting main project
project = dr.Project.start(
    sourcedata=df_train,
    project_name="Conformal Inference AIA - NBA",
    target=target_column,
    worker_count=-1,
)

# Wait
project.wait_for_autopilot(check_interval=120)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 4s)
In progress: 8, queued: 0 (waited: 6s)
In progress: 8, queued: 0 (waited: 10s)
In progress: 8, queued: 0 (waited: 17s)
In progress: 8, queued: 0 (waited: 30s)
In progress: 7, queued: 0 (waited: 56s)
In progress: 1, queued: 0 (waited: 108s)
In progress: 4, queued: 0 (waited: 211s)
In progress: 1, queued: 0 (waited: 332s)
In progress: 0, queued: 0 (waited: 453s)
In progress: 0, queued: 0 (waited: 574s)
In [9]:

# Get recommended model
best_model = dr.ModelRecommendation.get(project.id).get_model()
best_model
Out [9]:

Model('RandomForest Regressor')
In [10]:

# Compute training predictions (necessary for each method)
training_preds = compute_training_predictions(model=best_model)
training_preds.head()

Out [10]:

row_idpartition_idprediction
000.00.899231
11Holdout9.240274
220.015.043815
334.08.626567
442.015.435130
In [11]:

# Request predictions on testing data
pred_dataset = project.upload_dataset(sourcedata=df_test, max_wait=60 * 60 * 24)
predict_job = best_model.request_predictions(dataset_id=pred_dataset.id)
testing_preds = predict_job.get_result_when_complete(max_wait=60 * 60 * 24)
testing_preds.head()

Out [11]:

row_idprediction
0014.275717
1113.238045
2212.827469
3314.141054
447.113611
In [12]:

# Join predictions training and testing datasets
df_train = df_train.join(training_preds.set_index("row_id"))
df_test = df_test.join(testing_preds.set_index("row_id"))
display(df_train[[target_column, "prediction"]])
display(df_test[[target_column, "prediction"]])

game_scoreprediction
00.00.899231
10.09.240274
221.615.043815
34.48.626567
426.715.435130
799418.419.801230
799512.910.349299
799619.314.453104
79979.823.360390
79988.19.220965
7999 rows × 2 columns

game_scoreprediction
05.414.275717
116.513.238045
27.212.827469
323.814.141054
40.07.113611
199517.010.358305
199625.016.455246
1997-1.24.356278
199815.714.503841
199914.310.568885
2000 rows × 2 columns
In [13]:

# Compute the residuals on the training data
df_train["residuals"] = df_train[target_column] - df_train["prediction"]
df_train["residuals"].hist()

Out[13]:

<Axes: >


download 13

5. Method: Absolute conformal

The first method you’ll implement, regarded here as “absolute conformal”, is as follows:

  1. Take the absolute value of the out-of-sample residuals (these will be the conformity scores)
  2. Compute the quantile associated with the specified COVERAGE_LEVEL on the conformity scores
  3. Add and subtract this quantile value to the prediction

The resulting prediction intervals are guaranteed to be symmetric and the same width (since you’re simply applying a scalar value across all rows). For more information regarding this approach, see Section 2.3.

In [14]:

# Compute the conformity 

df_train["abs_residuals"] = df_train["residuals"].abs()
abs_residuals_q = df_train["abs_residuals"].quantile(COVERAGE_LEVEL)
abs_residuals_q

Out [14]:

11.028431310477108

In [15]:

# Using the conformity score, create the prediction intervals
df_test["method_1_lower"] = df_test["prediction"] - abs_residuals_q
df_test["method_1_upper"] = df_test["prediction"] + abs_residuals_q
In [16]:

# Compute metrics
method_1_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_1_lower"].values,
    y_upper=df_test["method_1_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Absolute Conformal",
)
Absolute Conformal: Percentage in the range (expecting 90.00): 89.000000
Absolute Conformal: Average width: 22.056863

6. Method: Signed conformal

“Signed conformal” follows a very similar procedure as the previous one:

  1. Compute lower and upper quantile levels based on the specified COVERAGE_LEVEL
  2. Apply these quantile levels to the out-of-sample residuals (i.e., conformity scores)
  3. Add these quantile values to the prediction

The main advantage to this approach over the previous one is that the prediction intervals are not forced to be symmetric, which can lead to better coverage for skewed targets. For more information regarding this approach, see Section 3.2.

In [17]:

# Compute lower and upper quantile levels to use based on the coverage
lower_coverage_q = round((1 - COVERAGE_LEVEL) / 2, 2)
upper_coverage_q = COVERAGE_LEVEL + (1 - COVERAGE_LEVEL) / 2
lower_coverage_q, upper_coverage_q

Out [17]:

(0.05, 0.95)

In [18]:

# Compute quantiles on the conformity scores
residuals_q_low = df_train["residuals"].quantile(lower_coverage_q)
residuals_q_high = df_train["residuals"].quantile(upper_coverage_q)
residuals_q_low, residuals_q_high

Out [18]:

(-10.573999229291612, 11.617478915155703)

In [19]:

# Using the quantile levels, create the prediction intervals
df_test["method_2_lower"] = df_test["prediction"] + residuals_q_low
df_test["method_2_upper"] = df_test["prediction"] + residuals_q_high
In [20]:

# Compute coverage / width
method_2_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_2_lower"].values,
    y_upper=df_test["method_2_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Signed Conformal",
)

Signed Conformal: Percentage in the range (expecting 90.00): 88.900000

Signed Conformal: Average width: 22.191478

7. Method: Locally-weighted conformal

While the primary advantage of the previous two methods is their simplicity, the disadvantage is that each prediction interval ends up being the exact same width. In many cases, it’s desirable to have varying widths that reflect the degree of confidence (i.e., harder to predict rows get a larger prediction interval and vice versa). To this end, you can make them more adaptive by using an auxiliary model to help augment the width on a per-row basis, depending on how much error we’d expect to see in a particular row. The “locally-weighted conformal” method is as follows:

  1. Take the absolute value of the out-of-sample residuals
  2. Build a model that regresses against the absolute residuals using the same feature set
  3. Compute the out-of-sample predictions from the absolute residuals model
  4. Scale the out-of-sample residuals using the the auxillary model’s predictions to create the conformity scores
  5. Compute the quantile associated with the specified COVERAGE_LEVEL on the conformity scores
  6. Multiply this quantile value and the auxillary model’s predictions together (this will result in a locally-weighted offset to apply, specific to each row)
  7. Add and subtract this multiplied value to each prediction

Although this approach is more involved, it addresses the disadvantage above by making the prediction intervals adaptive to each row while still being symmetric. For more information regarding this approach, see Section 5.2.

In [21]:

# Starting project to predict absolute residuals
project_abs_residuals = dr.Project.start(
    sourcedata=df_train.drop(
        columns=[
            target_column,
            "partition_id",
            "prediction",
            "residuals",
        ],
        axis=1,
    ),
    project_name=f"Predicting absolute residuals from {project.project_name}",
    target="abs_residuals",
    worker_count=-1,
)
project_abs_residuals.wait_for_autopilot(check_interval=120)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 8, queued: 0 (waited: 5s)
In progress: 8, queued: 0 (waited: 9s)
In progress: 8, queued: 0 (waited: 16s)
In progress: 8, queued: 0 (waited: 29s)
In progress: 7, queued: 0 (waited: 55s)
In progress: 1, queued: 0 (waited: 108s)
In progress: 3, queued: 0 (waited: 211s)
In progress: 1, queued: 0 (waited: 331s)
In progress: 0, queued: 0 (waited: 452s)
In progress: 0, queued: 0 (waited: 572s)
In [22]:

# Get recommended model
best_model_abs_residuals = dr.ModelRecommendation.get(
    project_abs_residuals.id
).get_model()
best_model_abs_residuals
Out [22]:

Model('RandomForest Regressor')

In [23]:

# Compute training predictions and join
df_train = df_train.join(
    compute_training_predictions(model=best_model_abs_residuals)
    .rename(columns={"prediction": f"abs_residuals_prediction"})
    .set_index("row_id")
    .drop(columns=["partition_id"], axis=1)
)
In [24]:

# Now compute prediction on testing data and join
pred_dataset_abs_residuals = project_abs_residuals.upload_dataset(
    sourcedata=df_test, max_wait=60 * 60 * 24
)
df_test = df_test.join(
    best_model_abs_residuals.request_predictions(
        dataset_id=pred_dataset_abs_residuals.id
    )
    .get_result_when_complete(max_wait=60 * 60 * 24)
    .rename(columns={"prediction": f"abs_residuals_prediction"})
    .set_index("row_id")
)
In [25]:

# Now we need to compute our locally-weighted conformity score and take the quantile
scaled_abs_residuals = df_train["abs_residuals"] / df_train["abs_residuals_prediction"]
scaled_abs_residuals_q = scaled_abs_residuals.quantile(COVERAGE_LEVEL)
scaled_abs_residuals_q
Out [25]:

2.0517307447009157
In [26]:

# Using the conformity score and absolute residuals model, create the prediction intervals
df_test["method_3_lower"] = (
    df_test["prediction"] - df_test["abs_residuals_prediction"] * scaled_abs_residuals_q
)
df_test["method_3_upper"] = (
    df_test["prediction"] + df_test["abs_residuals_prediction"] * scaled_abs_residuals_q
)
In [27]:

# Compute coverage / width
method_3_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_3_lower"].values,
    y_upper=df_test["method_3_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Locally-Weighted Conformal",
)
Locally-Weighted Conformal: Percentage in the range (expecting 90.00): 89.800000
Locally-Weighted Conformal: Average width: 21.710734

8. Method: Conformalized Quantile Regression

If you consider “locally-weighted conformal” to be a model-based extension of “absolute conformal”, then you could consider “conformalized quantile regression” to be a model-based extension of “signed conformal.” The goal is similar – create more adaptive prediction intervals, but it inherits the quality that the prediction intervals are not forced to be symmetric. The reference paper offers a symmetric and asymmetric formulation for the conformity scores. The former (Theorem 1) “allows coverage errors to be spread arbitrarily over the left and right tails” while the latter (Theorem 2) controls “the left and right tails independently, resulting in a stronger coverage guarantee” at the cost of slightly wider prediction intervals. Here, you’ll use the symmetric version. The full method is as follows:

  1. Compute lower and upper quantile levels based on the specified COVERAGE_LEVEL
  2. Train two quantile regression models at the lower and upper quantile levels on the training set
  3. Compute the out-of-sample predictions for both quantile models
  4. Compute the conformity scores, E, such that E = max[^qlowery, y^qupper]  where у is the target, ^qlower is the lower quantile predictions, and ^qupper is the upper quantile predictions
  5. Compute the quantile associated with the specified COVERAGE_LEVEL on E
  6. Add and subtract this quantile value to the quantile predictions

Notably, this approach is completely independent of your main model. That is, it doesn’t use any information about the recommended model defined earlier. This may or may not be desired, depending on the user’s preference or use case requirements. To ensure the main model’s predictions fall within the prediction interval, you’ll simply extend the interval’s boundary to be equal to the prediction itself (if the prediction lies outside of the respective prediction interval). Additionally, when using non-linear quantile regression methods (e.g., tree-based approaches, neural networks), it’s possible to experience quantile crossing (i.e., non-monotonic quantile predictions). To combat this, the referenced paper offers a solution via rearrangement, which is implemented here.

There are two ways to run quantile regression models in DataRobot:

  1. Set the project metric to quantile loss (which is currently a public-preview feature)
  2. Use certain blueprints with algorithms that support quantile loss as a hyperparameter in your current project. These includes gradient boosted trees from scikit-learn and Keras neural networks.

In this notebook, you’ll use the second approach, since it’s generally available. This involves using DataRobot’s advanced tuning functionality to change the loss function to the desired quantile loss.

In [28]:

# Get a GBT from scikit-learn (using the first one)
models = project.get_models()
gbt_models = [x for x in models if x.model_type.startswith("Gradient Boosted")]

# Check for GBT model. If none, make one.
if gbt_models:
    # Get most accurate one on validation set
    gbt_model = gbt_models[0]

else:
    # Pull models (will usually be at least one blueprint with a scikit-learn GBT)
    gbt_bps = [
        x
        for x in project.get_blueprints()
        if x.model_type.startswith("Gradient Boosted")
    ]

    # Get first one
    gbt_bp = gbt_bps[0]

    # Train it
    gbt_model = wait_for_async_model_creation(
        project_id=project.id,
        model_job_id=project.train(gbt_bp),
        max_wait=60 * 60 * 24,
    )

gbt_model
Out [28]:

Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Least-Squares Loss)')
In [29]:

# Train it on all the data
model_job_id = gbt_model.train(
    sample_pct=100,
    featurelist_id=gbt_model.featurelist_id,
)
gbt_model_100 = wait_for_async_model_creation(
    project_id=project.id, model_job_id=model_job_id, max_wait=60 * 60 * 24
)
gbt_model_100
Out [29]:

Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Least-Squares Loss)')

In [30]:

# Train quantile models
quantile_models = {lower_coverage_q: None, upper_coverage_q: None}

# Tune main keras model
for q in quantile_models.keys():
    # Start
    tune = gbt_model_100.start_advanced_tuning_session()

    # Set loss and level
    tune.set_parameter(
        task_name=gbt_model_100.model_type, parameter_name="loss", value="quantile"
    )
    tune.set_parameter(
        task_name=gbt_model_100.model_type, parameter_name="alpha", value=q
    )

    # Save job
    quantile_models[q] = tune.run()

# Wait and get resulting models
for q in quantile_models.keys():
    quantile_models[q] = quantile_models[q].get_result_when_complete(
        max_wait=60 * 60 * 24
    )

quantile_models
Out [30]:

{0.05: Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Quantile Loss)'),
 0.95: Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Quantile Loss)')}
In [31]:

# Compute training predictions
for q in quantile_models.keys():
    df_train = df_train.join(
        compute_training_predictions(model=quantile_models[q])
        .rename(columns={"prediction": f"quantile_prediction_{q}"})
        .set_index("row_id")
        .drop(columns=["partition_id"], axis=1)
    )

# Check
df_train[
    [
        target_column,
        f"quantile_prediction_{lower_coverage_q}",
        f"quantile_prediction_{upper_coverage_q}",
    ]
]

Out[31]:

game_scorequantile_prediction_0.05quantile_prediction_0.95
00.0-0.6617017.367953
10.0-0.10951016.625219
221.63.75237324.534147
34.40.52144718.209039
426.71.36721326.972765
799418.44.88263235.834840
799512.90.87948818.311118
799619.31.23530023.512004
79979.85.62211432.164047
79988.1-0.04649319.430948
7999 rows × 3 columns
In [32]:

# Making prediction on test data
quantile_models_test_predict = quantile_models.copy()
for q in quantile_models.keys():
    quantile_models_test_predict[q] = quantile_models[q].request_predictions(
        dataset_id=pred_dataset.id
    )

# Joining the results
for q in quantile_models.keys():
    df_test = df_test.join(
        quantile_models_test_predict[q]
        .get_result_when_complete(max_wait=60 * 60 * 24)
        .rename(columns={"prediction": f"quantile_prediction_{q}"})
        .set_index("row_id")
    )

# Check
df_test[
    [
        target_column,
        f"quantile_prediction_{lower_coverage_q}",
        f"quantile_prediction_{upper_coverage_q}",
    ]
]

Out[32]:

game_scorequantile_prediction_0.05quantile_prediction_0.95
05.40.91827725.233671
116.51.48829125.160815
27.20.31511724.488930
323.80.86442723.131123
40.00.23983821.055298
199517.0-0.11318923.151948
199625.05.38651823.425884
1997-1.2-1.63187722.678644
199815.72.10761523.573162
199914.33.64408222.918123
2000 rows × 3 columns
In [33]:

# Implement quantile rearrangement
q_crossing_train = (
    df_train[f"quantile_prediction_{lower_coverage_q}"]
    > df_train[f"quantile_prediction_{upper_coverage_q}"]
).sum()
q_crossing_test = (
    df_test[f"quantile_prediction_{lower_coverage_q}"]
    > df_test[f"quantile_prediction_{upper_coverage_q}"]
)
print(
    f"Number of rows with quantile crossing in training set (before rearrangement): {q_crossing_train}"
)
print(
    f"Number of rows with quantile crossing in testing set (before rearrangement): {q_crossing_test}"
)

# Capture quantile columns
quantile_pred_cols = [x for x in df_train.columns if x.startswith("quantile")]

# On training set
df_train = df_train.drop(quantile_pred_cols, axis=1).join(
    quantile_rearrangement(
        test_preds=df_train[
            [
                f"quantile_prediction_{lower_coverage_q}",
                f"quantile_prediction_{upper_coverage_q}",
            ]
        ],
        quantile_low=lower_coverage_q,
        quantile_high=upper_coverage_q,
    )
)

# On testing set
df_test = df_test.drop(quantile_pred_cols, axis=1).join(
    quantile_rearrangement(
        test_preds=df_test[
            [
                f"quantile_prediction_{lower_coverage_q}",
                f"quantile_prediction_{upper_coverage_q}",
            ]
        ],
        quantile_low=lower_coverage_q,
        quantile_high=upper_coverage_q,
    )
)

# Check again
q_crossing_train = (
    df_train[f"quantile_prediction_{lower_coverage_q}"]
    > df_train[f"quantile_prediction_{upper_coverage_q}"]
).sum()
q_crossing_test = (
    df_test[f"quantile_prediction_{lower_coverage_q}"]
    > df_test[f"quantile_prediction_{upper_coverage_q}"]
)
print(
    f"Number of rows with quantile crossing in training set (after rearrangement): {q_crossing_train}"
)
print(
    f"Number of rows with quantile crossing in testing set (after rearrangement): {q_crossing_test}"
)
Number of rows with quantile crossing in training set (before rearrangement): 7
Number of rows with quantile crossing in testing set (before rearrangement): 0       False
1       False
2       False
3       False
4       False
        ...  
1995    False
1996    False
1997    False
1998    False
1999    False
Length: 2000, dtype: bool
Number of rows with quantile crossing in training set (after rearrangement): 0
Number of rows with quantile crossing in testing set (after rearrangement): 
0       False
1       False
2       False
3       False
4       False
        ...  
1995    False
1996    False
1997    False
1998    False
1999    False
Length: 2000, dtype: bool
In [34]:

# Now we compute our conformity score and take the quantile
E_cqr = np.maximum(
    df_train[f"quantile_prediction_{lower_coverage_q}"] - df_train[target_column],
    df_train[target_column] - df_train[f"quantile_prediction_{upper_coverage_q}"],
)
E_cqr_q = E_cqr.quantile(COVERAGE_LEVEL)
E_cqr_q
In [34]:

1.7189887578306196
In [35]:

# Create the prediction intervals
df_test["method_4_lower"] = df_test[f"quantile_prediction_{lower_coverage_q}"] - E_cqr_q
df_test["method_4_upper"] = df_test[f"quantile_prediction_{upper_coverage_q}"] + E_cqr_q
In [36]:

# Extend to make sure the prediction is inside the interval
df_test["method_4_lower"] = df_test[["method_4_lower", "prediction"]].min(axis=1)
df_test["method_4_upper"] = df_test[["method_4_upper", "prediction"]].max(axis=1)
In [37]:

# Compute coverage / width
method_4_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_4_lower"].values,
    y_upper=df_test["method_4_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Conformalized Quantile Regression",
)
Conformalized Quantile Regression: Percentage in the range (expecting 90.00): 89.700000
Conformalized Quantile Regression: Average width: 21.706370

9. Comparing methods

Below you can see that the more advanced methods (i.e., “locally-weighted conformal” and “conformalized quantile regression”) yield similar coverage rates while producing smaller prediction intervals on average. Notably, this is just one dataset, and it’s suggested to empirically experiment with your own data to find the best method for your use case.

In [38]:

# Organize
summary = pd.DataFrame(
    {
        "Coverage": [
            method_1_coverage[0],
            method_2_coverage[0],
            method_3_coverage[0],
            method_4_coverage[0],
        ],
        "Average Width": [
            method_1_coverage[1],
            method_2_coverage[1],
            method_3_coverage[1],
            method_4_coverage[1],
        ],
        "Method": [
            "Absolute Conformal",
            "Signed Conformal",
            "Locally-Weighted Conformal",
            "Conformalized Quantile Regression",
        ],
    }
)
summary

Out[38]:

CoverageAverage WidthMethod
089.022.056863Absolute Conformal
188.922.191478Signed Conformal
289.821.710734Locally-Weighted Conformal
389.721.706370Conformalized Quantile Regression

10. Conclusion

This notebook demonstrates how one could build prediction intervals for any DataRobot model using methods derived from the conformal inference space. Conformal inference is a popular framework to use for generating such prediction intervals because they don’t require strict distributional assumptions to achieve the desired coverage, so long as the testing data is exchangeable with the training data. This characteristic was confirmed in the analysis done here. Because each approach offers different pros and cons, it’s worthwhile to use this AI Accelerator as a starting point for your own experiments to decide which one to implement for your use case. DataRobot offers the ability to easily implement each of these methods, even for the more advanced techniques. For more information on the topic of conformal inference, see the following introductory paper.

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Prediction Intervals via Conformal Inference appeared first on DataRobot AI Platform.

]]>
Reinforcement Learning in DataRobot https://www.datarobot.com/ai-accelerators/reinforcement-learning-in-datarobot/ Tue, 27 Feb 2024 16:12:54 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53773 In this notebook, we implement a very simple model based on the Q-learning algorithm. This notebook is intended to show a basic form of RL that doesn't require a deep understanding of neural networks or advanced mathematics and how one might deploy such a model in DataRobot.

The post Reinforcement Learning in DataRobot appeared first on DataRobot AI Platform.

]]>

This example shows the Grid World problem, where an agent learns to navigate a grid to reach a goal.

The notebook will go through the following steps:

  1. Define State and Action Space
  2. Create a Q-table to store expected rewards for each state/action combination
  3. Implement learning algorithm and train model
  4. Evaluate model
  5. Deploy to a DataRobot Rest API end-point

1. Define State and Action Space

Let’s first install datarobotx for some convenient DataRobot deployment procedures.

In [ ]:

%%bash
pip install -U datarobotx
In [ ]:

import random

import numpy as np
In [ ]:

# Grid settings
grid_size = 4

# funtion to build list of all state tuples


def build_state_list(grid_size):
    state_list = []
    for i in range(grid_size):
        for j in range(grid_size):
            state_list.append((i, j))
    return state_list


all_states = build_state_list(grid_size)

# Here we just try to reach the top right corner (could be center or any other state)
goal_state = (3, 3)
n_states = grid_size * grid_size
n_actions = 4  # Up, Down, Left, Right

2. Create a Q-table to store expected rewards for each state/action combination

In [ ]:

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

# Helper functions


def state_to_index(state):
    return state[0] * grid_size + state[1]


def index_to_state(index):
    return (index // grid_size, index % grid_size)


def get_possible_actions(state):
    actions = []
    if state[0] > 0:
        actions.append(0)  # Up
    if state[0] < grid_size - 1:
        actions.append(1)  # Down
    if state[1] > 0:
        actions.append(2)  # Left
    if state[1] < grid_size - 1:
        actions.append(3)  # Right
    return actions


# Correct the state transition function to prevent invalid states


def take_action(state, action):
    new_state = list(state)
    if action == 0 and state[0] > 0:
        new_state[0] -= 1  # Up
    if action == 1 and state[0] < grid_size - 1:
        new_state[0] += 1  # Down
    if action == 2 and state[1] > 0:
        new_state[1] -= 1  # Left
    if action == 3 and state[1] < grid_size - 1:
        new_state[1] += 1  # Right
    return tuple(new_state)

3. Implement learning algorithm and train model

In [ ]:

# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1  # Exploration rate
n_episodes = 100000

# Training the model with corrected state transitions
for episode in range(n_episodes):
    # start at a random state
    state = random.choice(all_states)
    done = state == goal_state

    while not done:
        state_index = state_to_index(state)
        if random.uniform(0, 1) < epsilon:
            # Explore: choose a random action
            action = random.choice(get_possible_actions(state))
        else:
            # Exploit: choose the best action from Q-table
            action = np.argmax(Q[state_index])

        # Take action and observe reward
        next_state = take_action(state, action)
        reward = 1 if next_state == goal_state else 0
        next_state_index = state_to_index(next_state)

        # Q-learning update
        Q[state_index, action] = Q[state_index, action] + learning_rate * (
            reward
            + discount_factor * np.max(Q[next_state_index])
            - Q[state_index, action]
        )

        # Transition to the next state
        state = next_state
        done = state == goal_state

4. Evaluate model

First, we will just show one path then see on average how many actions it takes to get to the goal state.

In [ ]:

# Evaluating the model
state = random.choice(all_states)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])  # Choose the best action
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)
Out [ ]:

Initial state: (3, 3)
[(3, 3)]
In [ ]:

total_actions = 0  # Total number of actions taken to reach the goal
for state in all_states:
    # Evaluating the model
    trajectory = [state]
    done = state == goal_state
    while not done:
        state_index = state_to_index(state)
        action = np.argmax(Q[state_index])  # Choose the best action
        state = take_action(state, action)
        trajectory.append(state)
        done = state == goal_state
        total_actions += 1
print(
    "Average number of actions taken to reach the goal:",
    total_actions / len(all_states),
)
Out [ ]:

Average number of actions taken to reach the goal: 3.0
Is this optimal? We know the optimal policy is to move up or to the right until we reach the goal. From the bottom left, this is 6 actions, for the next 2 states it is 5 actions, for the next 3 it is 4 actions, then 4->3, 3->2, 2->1, 1->0 as we already start at the goal state. By simple arithmetic we have

6+2*5+3*4+4*3+3*2+2*1 = 48

Total state = 16

Therefore, the optimal is 48/16 = 3 which is exactly our average number of actions.

5. Deploy to DataRobot Rest API end-point

In [ ]:

import pickle

import datarobot as dr
import numpy as np
import pandas as pd
In [ ]:

import os

os.makedirs("./storage/deploy/", exist_ok=True)
# save the Q table to a pickle file
with open("./storage/deploy/q_table.pkl", "wb") as f:
    pickle.dump(Q, f)

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

In [ ]:

dr_client = dr.Client()

Define Hooks for Deploying an Unstructured Custom Model. One could use a standard custom deployment, but using this to illustrate flexibity for more complex RL problems.

In [ ]:

def load_model(input_dir):
    """Custom model hook for loading our Q-table

    Make sure to execute the cell earlier in the notebook that create Q-table before deploying
    """

    with open(input_dir + "/storage/deploy/" + "q_table.pkl", "rb") as f:
        Q = pickle.load(f)

    return Q


def score_unstructured(model, data, query, **kwargs) -> str:
    """Custom model hook for return action.

    model: The output of load_model is passed to this object
    data: str
        Expects json string passed in request body.
        Required keys:
                state: tuple(int, int) .. Current state of the agent
    query: None
        Unused
    **kwargs: dict
        Unused

    Returns:
        JSON string with output action

    """
    import json

    import numpy as np

    Q = model
    grid_size = int(np.sqrt(len(Q)))  # Grid size is inferred from the Q-table

    # Helper functions
    def state_to_index(state):
        return state[0] * grid_size + state[1]
    
    data_dict = json.loads(data)
    state = data_dict["state"]

    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])

    return json.dumps({"action": action}, default=int)

Test out the prediction structure proior to deployment.

In [ ]:

import json

score_unstructured(
    load_model("."),
    json.dumps({"state": (0, 1)}),
    None,
)
Out [ ]:

'{"action": 1}'

Deploy the RL policy model. We will use this convenience method in drx.

  • Builds a new Custom Model Environment
  • You can also use a DataRobot Python Drop-in Enviroment (e.g. “6386dc1159c606b0d8beddc7”)
  • Assembles a new Custom Model with the provided hooks
  • Deploys an Unstructured Custom Model to your Deployments
  • Returns an object which can be used to make predictions

Use environment_id to re-use an existing Custom Model Environment that you’re happy with for shorter iteration cycles on the custom model hooks.

Note: See https://app.datarobot.com/docs/api/api-quickstart/index.html for instructions to setup a drconfig.yaml or call drx.Context() to initialize your credentials.

In [ ]:

import datarobotx as drx

drx.Context().endpoint = dr_client.endpoint
drx.Context().token = dr_client.token
In [ ]:

deployment = drx.deploy(
    "storage/deploy/",
    hooks={"score_unstructured": score_unstructured, "load_model": load_model},
    extra_requirements=[],
    # environment_id="6386dc1159c606b0d8beddc7",
)
Out [ ]:

# Deploying custom model
  - Unable to auto-detect model type; any provided paths and files will be
    exported - dependencies should be explicitly specified using
    `extra_requirements` or `environment_id`
  - Preparing model and environment...
  - Configured environment [[Custom]
    priceless-ganguly](https://app.datarobot.com/model-registry/custom-environments/65ac4115be769b7f85d5aaf9)
    with requirements:
      python 3.9.16
      datarobot-drum==1.10.14
      datarobot-mlops==9.2.8
      cloudpickle==2.2.1
  - Awaiting custom environment build...
Out [ ]:

  - Configuring and uploading custom model...

    100%|███████████████████████████| 11.0k/11.0k [00:00<00:00, 5.14MB/s]
  - Registered custom model
    [priceless-ganguly](https://app.datarobot.com/model-registry/custom-models/65ac42ce046ed058aada50c7/info)
    with target type: Unstructured
  - Creating and deploying model package...
Out [ ]:

  - Created deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Custom model deployment complete

Let’s try out our deployment and track the trajectory from the deployed policy (returns action)

In [ ]:

# If your deployment already occured or your notebook restarted due to inactivity, get ID from URL in the UI
# deployment = drx.Deployment("YOUR DEPLOYEMENT ID HERE")
deployment.predict_unstructured({"state": (0, 1)})
Out [ ]:

# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
{'action': 1}

Test and print trajectory.

In [ ]:

state = (0, 1)
goal_state = (3, 3)

print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    action = deployment.predict_unstructured({"state": state})["action"]
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)
Out [ ]:

Initial state: (0, 1)
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
[(0, 1), (1, 1), (2, 1), (3, 1), (3, 2), (3, 3)]
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Reinforcement Learning in DataRobot appeared first on DataRobot AI Platform.

]]>
Dimensionality Reduction in DataRobot Using t-SNE https://www.datarobot.com/ai-accelerators/dimensionality-reduction-in-datarobot-using-t-sne/ Tue, 27 Feb 2024 15:45:09 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53770 t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for dimensionality reduction that can effectively visualize high-dimensional data in a lower-dimensional space.

The post Dimensionality Reduction in DataRobot Using t-SNE appeared first on DataRobot AI Platform.

]]>
Dimensionality reduction can improve machine learning results by reducing computational complexity of the algorithms, preventing overfitting, and focusing on the most relevant features in the dataset. Note that this technique should only be used when the number of features is low.

Import libraries

In [ ]:
import datarobot as dr
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE

Connect to DataRobot

Instructions for obtaining your endpoint and token are located in the DataRobot API documentation here.

In [3]:

# either directly pass in your endpoint/token, use a config file, or connect using DataRobot notebooks
dr.Client()
Out [3]:

<datarobot.rest.RESTClientObject at 0x7f5f10312280>

Get dataset

This example uses data on the movement of a double pendulum which has already been loaded into DataRobot for this example, but can be found here.

In [40]:

# replace the dataset ID with your own data
ds_id = "62fbcdf583b30f0ef972dc31"

# get dataset from DataRobot
ds = dr.Dataset.get(ds_id)
df = ds.get_as_dataframe()
display(df)

Out[40]:

tx1x2v1v2a1a2
00.0000002.363.14-0.0100-0.01000-9.246.53
10.0008622.363.14-0.0180-0.00437-9.246.53
20.0017202.363.14-0.02590.00126-9.246.53
30.0025902.363.14-0.03390.00689-9.246.53
40.0034502.363.14-0.04180.01250-9.246.53
24249.970000-14.70-22.401.14001.820006.94-3.84
24259.980000-14.70-22.301.20001.790007.04-3.64
24269.980000-14.70-22.301.25001.760007.12-3.42
24279.990000-14.70-22.301.31001.730007.20-3.19
242810.000000-14.70-22.301.37001.700007.28-2.95
2429 rows × 7 columns

Reduce the number of features in the dataset

In [ ]:

# features to exclude from reduction
# can be target columns or ID columns or other
exclude_cols = ["t", "a2"]

model = TSNE(learning_rate=100, random_state=42)
transformed = model.fit_transform(df.drop(exclude_cols, axis=1))
In [25]:

transformed
Out [25]:

array([[  2.542573 , -80.301025 ],
       [  2.5057044, -80.29103  ],
       [  2.869162 , -80.113396 ],
       ...,
       [  9.5524645,  74.92201  ],
       [  9.630235 ,  74.90384  ],
       [  9.827253 ,  74.67084  ]], dtype=float32)

Create new dataframe with reduced columns and previously excluded columns

In [39]:

# get the tsne dataset
reduced_df = pd.DataFrame(transformed, columns=["tsne_x", "tsne_y"])

# join in target and time columns from original dataset
reduced_df = pd.concat([reduced_df, df[exclude_cols]], axis=1)

display(reduced_df)

Out[39]:

tsne_xtsne_yta2
02.542573-80.3010250.0000006.53
12.505704-80.2910310.0008626.53
22.869162-80.1133960.0017206.53
32.899721-80.0681080.0025906.53
42.924986-80.0203320.0034506.53
24249.65827174.4330379.970000-3.84
24259.41713574.9999929.980000-3.64
24269.55246474.9220129.980000-3.42
24279.63023574.9038399.990000-3.19
24289.82725374.67083710.000000-2.95
2429 rows × 4 columns

Upload back to DataRobot

In [42]:


ds = dr.Dataset.create_from_in_memory_data(
    data_frame=reduced_df, fname=f"{ds.name}.csv"
)
ds.modify(name=f"{ds.name} t-SNE Reduced")
ds
     
Out [42]:

Dataset(name='Double Pendulum.csv.csv t-SNE Reduced', id='65a970bc040d9a438cdfb9de')
     
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Dimensionality Reduction in DataRobot Using t-SNE appeared first on DataRobot AI Platform.

]]>
MLFlow + DataRobot API for Tracking Experimentation https://www.datarobot.com/ai-accelerators/mlflow-datarobot-api-for-tracking-experimentation/ Tue, 27 Feb 2024 14:03:58 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53758 As illustrated below, you will use the orchestration notebook to design and run the experiment notebook, with permutations of parameters handled automatically. At the end of the experiments, copies of the experiment notebook will be available, with the outputs for each permutation for collaboration and reference.

The post MLFlow + DataRobot API for Tracking Experimentation appeared first on DataRobot AI Platform.

]]>
Experimentation is a mandatory activity in any machine learning developer’s day-to-day activities. For time series projects, the number of parameters and settings to tune for achieving the best model is in itself a vast search space. 

About this Accelerator

Many of the experiments in time series use cases are common and repeatable. Tracking these experiments and logging results is a task that needs streamlining. Manual errors and time limitations may lead to selection of suboptimal models leaving better models lost in global minima. 

The integration of DataRobot API, Papermill, and MLFlow automates machine learning experimentation so that is becomes easier, robust, and easy to share.

image

Run the mlflow ui command in the same directory to get the dashboard.

1. Use MLFlow with the DataRobot API for Experimentation and Logging

This notebook provides a framework that showcases the integration of MLFlow and Papermill to track machine learning experiments with DataRobot.

This framework outlines how to:

  • Use MLFlow with the DataRobot API to track and log ML experiments
    • Benefit: Consistent comparison of results across experiments
  • Use Papermill with the DataRobot API to create artifacts from machine learning experiments to reduce effort needed for collaboration
    • Benefit: Automation of experiments to avoid errors and reduce manual effort
  • Execute jupyter notebooks with parameters like Python scripts
  • Loop through parameter combinations to run multiple projects; build a Model Factory.

This notebook is the experimentation notebook for running individual time series experiments. Papermill is used to receive parameters from the main notebook (orchestration_notebook.ipynb) and run a copy of this notebook for each combination of the parameters.

The experiment notebook doesn’t require any updates as the parameters are passed from the main notebook. However, this notebook will be updated for different modeling approach like AutoML, Unsupervised learning, etc.

Setup

Bind inputs

In [28]:

FDW = 35
KIA = False
UUID = str("bcf6c090-1899-11ed-a7a1-f018981f05a4")
ACC_OPT = False
SRCH_INT = False
SEGMENTED = False
MODE = "quick"
TRAINING_DATA = "./DR_Demo_Sales_Multiseries_training (1).xlsx"
DATE_COL = "Date"
TRAINING_STOP_DATE = "01-06-2014"
TRAINING_STOP_DATE_FORMAT = "%d-%m-%Y"
DR_AUTH_YAML_FILE = "~/.config/datarobot/drconfig.yaml"
TARGET_COL = "Sales"
KIA_COLS = ["Marketing", "Near_Xmas", "Near_BlackFriday", "Holiday", "DestinationEvent"]
IS_MULTISERIES = True
MULTISERIES_COLS = ["Store"]
REFERENCE_NOTEBOOK = (
    "./experiments_bkup/experiment_d666bc12-7602-11ed-99f4-f018981f05a4.ipynb"
)

Import libraries

In [8]:

import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
from permetrics.regression import (  # permetrics library for simplifying metric calculation
    RegressionMetric,
)

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

# Authenticate in to your DataRobot instance
import datarobot as dr
import yaml

cred_file = open(DR_AUTH_YAML_FILE, "r")
credentials = yaml.safe_load(cred_file)

DATAROBOT_API_TOKEN = credentials["token"]
DATAROBOT_ENDPOINT = credentials["endpoint"]

client = dr.Client(
    token=DATAROBOT_API_TOKEN,
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix="AIA-AE-MLF-1",  # Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client
Out [8]:

Project(Repex_bcf6c090-1899-11ed-a7a1-f018981f05a4)

Import training data

In [3]:

df = pd.DataFrame()
if TRAINING_DATA.find(".csv") != -1:
    df = pd.read_csv(TRAINING_DATA, parse_dates=[DATE_COL])
elif TRAINING_DATA.find(".xls") != -1:
    df = pd.read_excel(TRAINING_DATA, parse_dates=[DATE_COL])
else:
    df = pd.DataFrame()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7140 entries, 0 to 7139
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Store             7140 non-null   object        
 1   Date              7140 non-null   datetime64[ns]
 2   Sales             7140 non-null   int64         
 3   Store_Size        7140 non-null   int64         
 4   Num_Employees     7140 non-null   int64         
 5   Returns_Pct       7140 non-null   float64       
 6   Num_Customers     7140 non-null   int64         
 7   Pct_On_Sale       7130 non-null   float64       
 8   Marketing         7140 non-null   object        
 9   Near_Xmas         7140 non-null   int64         
 10  Near_BlackFriday  7140 non-null   int64         
 11  Holiday           7140 non-null   object        
 12  DestinationEvent  7140 non-null   object        
 13  Pct_Promotional   7140 non-null   float64       
 14  Econ_ChangeGDP    80 non-null     float64       
 15  EconJobsChange    1020 non-null   float64       
 16  AnnualizedCPI     240 non-null    float64       
dtypes: datetime64[ns](1), float64(6), int64(6), object(4)
memory usage: 948.4+ KB

Private holdout

Set a cutoff date for private holdout. This is necessary to enable the same holdout for all experiments irrespective of feature derivation windows and forecast windows.

In [5]:

training_stop_date = pd.to_datetime(
    TRAINING_STOP_DATE, format=TRAINING_STOP_DATE_FORMAT
)
In [6]:

df_train = df[df[DATE_COL] < training_stop_date]
df_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7000 entries, 0 to 7125
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Store             7000 non-null   object        
 1   Date              7000 non-null   datetime64[ns]
 2   Sales             7000 non-null   int64         
 3   Store_Size        7000 non-null   int64         
 4   Num_Employees     7000 non-null   int64         
 5   Returns_Pct       7000 non-null   float64       
 6   Num_Customers     7000 non-null   int64         
 7   Pct_On_Sale       6990 non-null   float64       
 8   Marketing         7000 non-null   object        
 9   Near_Xmas         7000 non-null   int64         
 10  Near_BlackFriday  7000 non-null   int64         
 11  Holiday           7000 non-null   object        
 12  DestinationEvent  7000 non-null   object        
 13  Pct_Promotional   7000 non-null   float64       
 14  Econ_ChangeGDP    80 non-null     float64       
 15  EconJobsChange    1000 non-null   float64       
 16  AnnualizedCPI     230 non-null    float64       
dtypes: datetime64[ns](1), float64(6), int64(6), object(4)
memory usage: 984.4+ KB

Modeling

Create a DataRobot project

In [8]:

# Upload data and create a new DataRobot project
project = dr.Project.create(df_train, project_name="Repex_" + UUID)
project
Out [8]:

Project(Repex_bcf6c090-1899-11ed-a7a1-f018981f05a4)

Configure project settings

Set up time series settings for the newly created project.

In [9]:

known_in_advance = KIA_COLS
feature_settings = [
    dr.FeatureSettings(feat_name, known_in_advance=True)
    for feat_name in known_in_advance
]

time_partition = dr.DatetimePartitioningSpecification(
    datetime_partition_column=DATE_COL,
    use_time_series=True,
    feature_derivation_window_start=-1 * FDW,
    feature_derivation_window_end=0,
    forecast_window_start=1,
    forecast_window_end=14,
)

if KIA:
    time_partition.feature_settings = feature_settings

if IS_MULTISERIES:
    time_partition.multiseries_id_columns = MULTISERIES_COLS

advanced_options = dr.AdvancedOptions(
    accuracy_optimized_mb=ACC_OPT, autopilot_with_feature_discovery=SRCH_INT
)

Initiate Autopilot

After creating settings objects, Autopilot is started using the analyze_and_model function.

In [10]:
project.analyze_and_model(
    target=TARGET_COL,
    partitioning_method=time_partition,
    max_wait=3600,
    worker_count=-1,
    advanced_options=advanced_options,
    mode=MODE,
)
print(project.get_uri())
project.wait_for_autopilot()
https://app.datarobot.com/projects/63902ec8c32fb2b2077f5da1/models
In progress: 19, queued: 2 (waited: 0s)
In progress: 19, queued: 2 (waited: 1s)
In progress: 19, queued: 2 (waited: 3s)
In progress: 19, queued: 2 (waited: 5s)
In progress: 19, queued: 2 (waited: 7s)
In progress: 19, queued: 2 (waited: 10s)
In progress: 19, queued: 2 (waited: 14s)
In progress: 19, queued: 2 (waited: 21s)
In progress: 19, queued: 2 (waited: 35s)
In progress: 19, queued: 2 (waited: 56s)
In progress: 19, queued: 2 (waited: 78s)
In progress: 19, queued: 0 (waited: 99s)
In progress: 12, queued: 0 (waited: 120s)
In progress: 9, queued: 0 (waited: 141s)
In progress: 6, queued: 0 (waited: 162s)
In progress: 4, queued: 0 (waited: 184s)
In progress: 2, queued: 0 (waited: 205s)
In progress: 1, queued: 0 (waited: 226s)
In progress: 0, queued: 0 (waited: 247s)
In progress: 4, queued: 0 (waited: 268s)
In progress: 4, queued: 0 (waited: 289s)
In progress: 2, queued: 0 (waited: 310s)
In progress: 2, queued: 0 (waited: 331s)
In progress: 0, queued: 0 (waited: 352s)
In progress: 0, queued: 0 (waited: 373s)
In progress: 0, queued: 0 (waited: 395s)
In progress: 1, queued: 0 (waited: 416s)
In progress: 1, queued: 0 (waited: 437s)
In progress: 1, queued: 0 (waited: 458s)
In progress: 1, queued: 0 (waited: 479s)
In progress: 1, queued: 0 (waited: 500s)
In progress: 1, queued: 0 (waited: 521s)
In progress: 1, queued: 0 (waited: 542s)
In progress: 0, queued: 0 (waited: 564s)
In progress: 1, queued: 0 (waited: 585s)
In progress: 1, queued: 0 (waited: 606s)
In progress: 1, queued: 0 (waited: 627s)
In progress: 1, queued: 0 (waited: 648s)
In progress: 1, queued: 0 (waited: 669s)
In progress: 1, queued: 0 (waited: 690s)
In progress: 1, queued: 0 (waited: 711s)
In progress: 1, queued: 0 (waited: 732s)
In progress: 0, queued: 0 (waited: 754s)
In progress: 0, queued: 0 (waited: 775s)
In progress: 0, queued: 0 (waited: 796s)
In progress: 0, queued: 0 (waited: 817s)

After Autopilot completes, get the recommended model from DataRobot

In [11]:

recommendation = dr.ModelRecommendation.get(project.id)
recommended_model = recommendation.get_model()
print(recommended_model)

DatetimeModel('eXtreme Gradient Boosted Trees Regressor with Early Stopping (learning rate =0.3)')

Performance validation

Create the private holdout from original dataset and get predictions from DataRobot recommended model. Once predictions are available, the predictions are compared to actuals using regression metrics.

In [12]:

dataset = project.upload_dataset(df, forecast_point=training_stop_date)
pred_job = recommended_model.request_predictions(dataset_id=dataset.id)
preds = pred_job.get_result_when_complete()
In [14]:

preds["timestamp"] = pd.to_datetime(preds["timestamp"], utc=True)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], utc=True)
preds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   row_id             130 non-null    int64              
 1   prediction         130 non-null    float64            
 2   forecast_distance  130 non-null    int64              
 3   forecast_point     130 non-null    object             
 4   timestamp          130 non-null    datetime64[ns, UTC]
 5   series_id          130 non-null    object             
dtypes: datetime64[ns, UTC](1), float64(1), int64(2), object(2)
memory usage: 6.2+ KB
In [15]:

if IS_MULTISERIES:
    df_comparison = df[MULTISERIES_COLS + [DATE_COL, TARGET_COL]].merge(
        preds[["prediction", "timestamp", "series_id"]],
        left_on=MULTISERIES_COLS + [DATE_COL],
        right_on=["series_id", "timestamp"],
    )
else:
    df_comparison = df[[DATE_COL, TARGET_COL]].merge(
        preds[["prediction", "timestamp"]], left_on=[DATE_COL], right_on=["timestamp"]
    )
assert df_comparison.shape[0] == preds.shape[0]
df_comparison.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 130 entries, 0 to 129
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   Store       130 non-null    object             
 1   Date        130 non-null    datetime64[ns, UTC]
 2   Sales       130 non-null    int64              
 3   prediction  130 non-null    float64            
 4   timestamp   130 non-null    datetime64[ns, UTC]
 5   series_id   130 non-null    object             
dtypes: datetime64[ns, UTC](2), float64(1), int64(1), object(2)
memory usage: 7.1+ KB
Plotting actuals vs predicted for visual verification
In [24]:

if not IS_MULTISERIES:
    plt.plot(
        df_comparison["timestamp"],
        df_comparison["prediction"],
        label="Prediction",
        color="red",
    )
    plt.plot(
        df_comparison["timestamp"],
        df_comparison["Sales"],
        label="Actuals",
        color="blue",
        alpha=0.5,
    )
else:
    df_viz = df_comparison[
        df_comparison["series_id"] == df_comparison.series_id.unique()[0]
    ]
    plt.plot(df_viz["timestamp"], df_viz["prediction"], label="Prediction", color="red")
    plt.plot(
        df_viz["timestamp"], df_viz["Sales"], label="Actuals", color="blue", alpha=0.5
    )
plt.xticks(rotation=90)
Out [24]:

(array([16224., 16226., 16228., 16230., 16232., 16234.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])
download 11
In [17]:

# Validate experiment performance
evaluator = RegressionMetric(
    df_comparison[TARGET_COL].values, df_comparison["prediction"].values
)

Tracking and logging experiments

Log experiment metrics and parameters for display and comparison on the MLFlow UI.

In [30]:

with mlflow.start_run():
    mlflow.log_param("Project URL", project.get_uri())  # URL for DataRobot Project
    mlflow.log_param(
        "Notebook Location", REFERENCE_NOTEBOOK
    )  # location of final notebook for reference
    mlflow.log_param("Feature Derivation Window", FDW)  # feature derivation used
    mlflow.log_param(
        "Enabled Known In Advance features", KIA
    )  # known in advance setting
    mlflow.log_param(
        "Ran Accuracy Optimized BPs", ACC_OPT
    )  # accuracy optimized setting
    mlflow.log_param(
        "Enabled Search Interactions option", SRCH_INT
    )  # search for interactions setting
    mlflow.log_param("Autopilot Mode", MODE)  # autopilot mode

    mlflow.log_artifact(REFERENCE_NOTEBOOK)  # location of final notebook for reference

    # logging model performance metrics
    mlflow.log_metric("MASE", evaluator.MASE())
    mlflow.log_metric("MAPE", evaluator.MAPE())
    mlflow.log_metric("RMSE", evaluator.RMSE())
    mlflow.log_metric("MAE", evaluator.MAE())
    mlflow.log_metric("R2", evaluator.R2())
    mlflow.log_metric("Support", preds.shape[0])

2. Integrate MLFlow and Papermill to Track ML Experiments with DataRobot

This notebook outlines how to:

  • Use MLFlow with DataRobot API to track and log machine learning experiments
    • Benefit: Consistent comparison of results across experiments
  • Use Papermill with DataRobot API to create artifacts from machine learning experiments to reduce effort needed for collaboration
    • Benefit: Automation of experiments to avoid errors and reduce manual effort.
  • Execute Jupyter notebooks with parameters like Python scripts
  • Loop through parameter combinations to run multiple projects; build a Model Factory.

This orchestration notebook illustrates the framework to integrate MLFlow and Papermill with the DataRobot API to run the experiment notebook with different parameters per experiment.

This notebook will run the experiment_notebook.ipynb with different parameters

Required Python Libraries:

Setup

Import libraries

uuid is used to generate unique identifiers for our experimentation. itertools is used to generate permutations of all experiments.

In [1]:

import itertools
import os
import uuid

import papermill as pm

Use the snippet below to create requisite folders.

In [2]:

if not os.path.isdir("./experiments_bkup"):
    os.mkdir("./experiments_bkup")

Configure use case settings

These are the basic settings needed to run Time Series projects through the DataRobot API. These settings have to be updated for the intended use case.

In [3]:

DR_AUTH_YAML_FILE = (
    "~/.config/datarobot/drconfig.yaml"  # yaml file with authentication details
)
TRAINING_DATA = (
    "./DR_Demo_Sales_Multiseries_training (1).xlsx"  # location of training dataset
)
DATE_COL = "Date"  # datetime column
TRAINING_STOP_DATE = "01-06-2014"  # cutoff date for private holdout for experiments
TRAINING_STOP_DATE_FORMAT = (
    "%d-%m-%Y"  # datetime format specifier for TRAINING_STOP_DATE
)
TARGET_COL = "Sales"  # target column for the usecase
KIA_COLS = [
    "Marketing",
    "Near_Xmas",
    "Near_BlackFriday",
    "Holiday",
    "DestinationEvent",
]  # known in advance features
IS_MULTISERIES = True  # does the dataset have multiple time series
MULTISERIES_COLS = [
    "Store"
]  # if the dataset has multiple ts, columns that uniquely identify a ts.

Scenario

There are many experiments that need to be tried in Time Series projects. The most basic ones include experimenting with multiple forecast derivation windows and enabling known in advance features. Only these two parameters can result in atleast six different experiments as shown by the example in the cell below;

First experiment series set

This example starts with basic set of experiments to identify quickly if the dataset has any signal. You will use a combination of feature derivation windows and known in advance features to do so.

In [4]:
fdws = [
    35,
    70,
    14,
]  # The Time Series feature derivation window parameter values to experiment
kias = [False, True]  # The known in advance parameter values to experiment with

Run multiple projects for all permutations of the values from the above two parameter sets. This can be seen as a “DataRobot Project Factory” where you will run multiple projects using Papermill. Papermill allows us to send parameters to a Jupyter notebook and execute if for those parameters. It will also create copies of the notebook execute in a specified folder.

In [5]:

INPUT_PATH = "./experiment_notebook.ipynb"
for item in itertools.product(fdws, kias):
    UUID = str(uuid.uuid1())
    OUTPUT_PATH = "./experiments_bkup/experiment_{}.ipynb".format(UUID)
    pm.execute_notebook(
        input_path=INPUT_PATH,
        output_path=OUTPUT_PATH,
        parameters={
            "FDW": item[0],
            "KIA": item[1],
            "UUID": UUID,
            "DR_AUTH_YAML_FILE": DR_AUTH_YAML_FILE,
            "TRAINING_DATA": TRAINING_DATA,
            "DATE_COL": DATE_COL,
            "TRAINING_STOP_DATE": TRAINING_STOP_DATE,
            "TRAINING_STOP_DATE_FORMAT": TRAINING_STOP_DATE_FORMAT,
            "TARGET_COL": TARGET_COL,
            "KIA_COLS": KIA_COLS,
            "IS_MULTISERIES": IS_MULTISERIES,
            "MULTISERIES_COLS": MULTISERIES_COLS,
            "REFERENCE_NOTEBOOK": OUTPUT_PATH,
        },
    )
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Experiment results

After completion of the above set of experiments, MLFlow dashboard can be invoked for perusal of the results. Run the below cell or the contents of the cell in command line to run the MLFlow server and UI.

In [6]:

# Ensure to stop the execution of this cell before running next cells
!mlflow ui
[2022-12-07 15:47:24 +0530] [20341] [INFO] Starting gunicorn 20.1.0
[2022-12-07 15:47:24 +0530] [20341] [INFO] Listening at: http://127.0.0.1:5000 (20341)
[2022-12-07 15:47:24 +0530] [20341] [INFO] Using worker: sync
[2022-12-07 15:47:24 +0530] [20345] [INFO] Booting worker with pid: 20345
^C
[2022-12-07 15:58:55 +0530] [20341] [INFO] Handling signal: int
[2022-12-07 15:58:55 +0530] [20345] [INFO] Worker exiting (pid: 20345)

Further experimentations

Once comfortable with the initial set of experiments and results, you can further expand the experiment combinations as below. The advantage of parameterization of the notebook is that you can run only the experiments that are needed and you can keep building on the experiments you already ran.

For example, you can run accuracy optimized blueprints set as “is false” by default if you have run that experiment in the prior cells. Time and Compute can be saved by only using the True option for the parameter in subsequent experiments.

In [7]:

# Import datarobot library for the enums
import datarobot as dr
In [8]:

fdws = [35, 14]  # TS feature derivation window parameter values to experiment
kias = [False]  # Known in advance parameter values to experiment
acc_opt = [True]  # Enable accuracy optimized blueprints
search_int = [True]  # Search for interactions between features
mode = [dr.enums.AUTOPILOT_MODE.FULL_AUTO]  # Autopilot mode values to experiment
In [9]:

INPUT_PATH = "./experiment_notebook.ipynb"
for item in itertools.product(*[fdws, kias, acc_opt, search_int, mode]):
    UUID = str(uuid.uuid1())
    OUTPUT_PATH = "./experiments_bkup/experiment_{}.ipynb".format(UUID)
    pm.execute_notebook(
        input_path=INPUT_PATH,
        output_path=OUTPUT_PATH,
        parameters={
            "FDW": item[0],
            "KIA": item[1],
            "ACC_OPT": item[2],
            "UUID": UUID,
            "DR_AUTH_YAML_FILE": DR_AUTH_YAML_FILE,
            "TRAINING_DATA": TRAINING_DATA,
            "DATE_COL": DATE_COL,
            "TRAINING_STOP_DATE": TRAINING_STOP_DATE,
            "TRAINING_STOP_DATE_FORMAT": TRAINING_STOP_DATE_FORMAT,
            "TARGET_COL": TARGET_COL,
            "KIA_COLS": KIA_COLS,
            "IS_MULTISERIES": IS_MULTISERIES,
            "MULTISERIES_COLS": MULTISERIES_COLS,
            "REFERENCE_NOTEBOOK": OUTPUT_PATH,
        },
    )
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
Executing:   0%|          | 0/25 [00:00<?, ?cell/s]
In [12]:

!mlflow ui
[2022-12-07 17:04:35 +0530] [45452] [INFO] Starting gunicorn 20.1.0
[2022-12-07 17:04:35 +0530] [45452] [INFO] Listening at: http://127.0.0.1:5000 (45452)
[2022-12-07 17:04:35 +0530] [45452] [INFO] Using worker: sync
[2022-12-07 17:04:35 +0530] [45457] [INFO] Booting worker with pid: 45457
^C
[2022-12-07 17:05:36 +0530] [45452] [INFO] Handling signal: int
[2022-12-07 17:05:36 +0530] [45457] [INFO] Worker exiting (pid: 45457)
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post MLFlow + DataRobot API for Tracking Experimentation appeared first on DataRobot AI Platform.

]]>
Mastering Multiple Datasets with Feature Discovery https://www.datarobot.com/ai-accelerators/mastering-multiple-datasets-with-feature-discovery/ Mon, 26 Feb 2024 17:46:23 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53733 This notebook outlines a repeatable framework for end-to-end production machine learning. It includes time-aware feature engineering across multiple tables, training dataset creation, model development, and production deployment.

The post Mastering Multiple Datasets with Feature Discovery appeared first on DataRobot AI Platform.

]]>
Problem Framing

It is common to build training data from multiple sources, but this process can be time consuming and error prone, especially when you need to create many time-aware features.

  • Event based data is present in every vertical. For example, customer transactions in retail or banking, medical visits, or production line data in manufacturing.
  • Summarizing this information at the parent (Entity) level is necessary for most classification and regression use cases. For example, if you are predicting fraud, churn, or propensity to purchase something, you will likely want summary statistics of a customers transactions over a historical window.

This raises many practical considerations as a data scientist: How far back in time is relevant for training? Within that training period, which windows are appropriate for features? 30 days? 15? 7? Further, which datasets and variables should you consider for feature engineering? Answering these conceptual questions requires domain expertise or interaction with business SMEs.

In practice, especially at the MVP stage, it is common to limit the feature space you explore to what’s been created previously or add a few new ideas from domain expertise.

  • Feature stores can be helpful to quickly try features which were useful in a previous use case, but it is a strong assumption that previously generated lagged features will adapt well across all future use cases.
  • There are almost always important interactions you haven’t evaluated or thought of.

Multiple tactical challenges arise as well. Some of the more common ones are:

  • Time formats are inconsistent between datasets (e.g., minutes vs. days), and need to be handled correctly to avoid target leakage.
  • Encoding text and categorical data aggregates over varying time horizons across tables is generally painful and prone to error.
  • Creating a hardened data pipeline for production can take weeks depending on the complexity.
  • A subtle wrinkle is that short and long-term effects of data matter, particularly with customers/patients/humans, and those effects change over time. It’s hard to know apriori which lagged features to create.
  • When data drifts and behavior changes, you very well may need entirely new features post-deployment, and the process starts all over.

All of these challenges inject risk into your MVP process. The best case scenario is historical features capture signal in your new use case, and further exploration to new datasets is limited when the model is “good enough”. The worst case scenario is you determine the use case isn’t worth pursuing, as your features don’t capture the new signal. You often end up in the middle, struggling to know how to improve a model you are sure can be better.

What if you could radically collapse the cycle time to explore and discover features across any relevant dataset?

This notebook provides a template to:

  1. Load data into Snowflake and register with DataRobot’s AI Catalog.
  2. Configure and build time aware features across multiple historical time-windows and datasets using Snowflake (applicable to any database).
  3. Build and evaluate multiple feature engineering approaches and algorithms for all data types.
  4. Extract insights and identify the best feature engineering and modeling pipeline.
  5. Test predictions locally.
  6. Deploy the best performing model and all feature engineering in a Docker container, and expose a REST API.
  7. Score from Snowflake and write predictions back to Snowflake.

For more information about the Python client, reference the documentation.

Setup

Import libraries

In [1]:

# requires Python 3.8 or higher
from IPython.display import display, HTML
import matplotlib.pylab as plt
import numpy as np
import pandas as pd

from utils import prepare_demo_tables_in_db

Connect to DataRobot

In [2]:

# Authenticate in to your DataRobot instance
import datarobot as dr  # Requires version 3.0 or later

dr.__version__


# use your DataRobot API Credentials
DATAROBOT_API_TOKEN = "YOURAPITOKEN"
DATAROBOT_ENDPOINT = "YOUR_DATAROBOT_BASE_URL"

client = dr.Client(
    token=DATAROBOT_API_TOKEN,
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix="AIA-AE-AFD-9",  # Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client
Out [2]:

'3.0.2'

Predict defaults from customer transactions, profiles, and default data

The data used in this notebook is from anonymized historical loan application data, and used to predict if a customer will default on their loan or not. There are three tables:

  • LC_Train – Contains a customer ID, the date, and whether the loan defaulted or not (the target variable). This is the primary dataset.
  • LC_Profile – Contains loan level data (interest rate, purpose for the loan, etc.), and customer info (address, employment, etc.). This is a secondary dataset.
  • LC_Transactions – Has multiple transactions per customer across accounts, transaction type, and time. This is a secondary dataset.
  • LC Test – The validation set version of LC_Train. This used for showing the deployment pipeline on new data, and is our primary dataset for an example deployment.

You want to create a training dataset of one record per customer, with relevant time-based features from their transactions as well as non-time based features from their profile. The data is in a public S3 bucket and will be transferred to your Snowflake instance.

Create Governed Datasets for Training

Import from S3 to Snowflake

Update the fields below with your Snowflake information to read in each file from S3 and create new tables in your Snowflake instance.

In [3]:

# Fill out the credentials for your instance. You will need write access to a database.
db_user = "your_username"  # Username to access snowflake database
db_password = "your_password"  # Uassword
account = "eg:datarobotap_partner.ap-southeast-2"  # Snowflake account identifier, can be found in the db_url
db = "YOUR_DB_NAME"  # Database to Write_To, "Snowflake Demo DB" in our example below
warehouse = "YOUR_WAREHOUSE"  # Warehouse
schema = "YOUR_SCHEMA"

db_url = "jdbc:snowflake://{account}.snowflakecomputing.com/?warehouse={warehouse}&db={db}".format(
    account=account, db=db, warehouse=warehouse
)
In [4]:

# The cell below writes the tables to your instance
response = prepare_demo_tables_in_db(
    db_user=db_user,
    db_password=db_password,
    account=account,
    db=db,
    warehouse=warehouse,
    schema=schema,
)

******************************

table: LC_PROFILE

01234
CustomerIDC900002437C900006073C900007834C900001691C900002594
loan_amnt40001600087001800016000
funded_amnt40001600087001800016000
term60 months60 months36 months60 months36 months
int_rate7.29%18.25%7.88%11.49%11.83%
installment79.76408.48272.15395.78530.15
gradeAFABB
sub_gradeA4F1A5B4B3
emp_titleTime Warner CableOttawa UniversityKennedy WilsonTOWN OF PLATTEKILLBelmont Correctional
emp_length10+ years< 1 year4 years10+ years10+ years
home_ownershipMORTGAGERENTRENTMORTGAGEMORTGAGE
annual_inc5000039216650005750050004
verification_statusnot verifiednot verifiednot verifiednot verifiedVERIFIED – income
purposemedicaldebt_consolidationcredit_carddebt_consolidationdebt_consolidation
zip_code766xx660xx916xx124xx439xx
addr_stateTXKSCANYOH
info for  LC_PROFILE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CustomerID           10000 non-null  object 
 1   loan_amnt            10000 non-null  int64  
 2   funded_amnt          10000 non-null  int64  
 3   term                 10000 non-null  object 
 4   int_rate             10000 non-null  object 
 5   installment          10000 non-null  float64
 6   grade                10000 non-null  object 
 7   sub_grade            10000 non-null  object 
 8   emp_title            9408 non-null   object 
 9   emp_length           9741 non-null   object 
 10  home_ownership       10000 non-null  object 
 11  annual_inc           9999 non-null   float64
 12  verification_status  10000 non-null  object 
 13  purpose              10000 non-null  object 
 14  zip_code             10000 non-null  object 
 15  addr_state           10000 non-null  object 
dtypes: float64(2), int64(2), object(12)
memory usage: 1.2+ MB
None
writing LC_PROFILE to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/LC_profile.csv
******************************
table: LC_TRANSACTIONS

CustomerIDAccountIDDateAmountDescription
0C900000001A4849932842016-07-2179.42alcohol
1C900000001A6516390772016-07-3137.87government charges
2C900000002A3550569692016-06-294.92amortisation
3C900000002A3550569692016-07-0118.97interest on purchases
4C900000002A3550569692016-07-0229.06charity
info for  LC_TRANSACTIONS
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412459 entries, 0 to 412458
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   CustomerID   412459 non-null  object 
 1   AccountID    412459 non-null  object 
 2   Date         411129 non-null  object 
 3   Amount       403750 non-null  float64
 4   Description  412441 non-null  object 
dtypes: float64(1), object(4)
memory usage: 15.7+ MB
None
writing LC_TRANSACTIONS to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/LC_transactions.csv
******************************
table: LC_TRAIN

CustomerIDBadLoandate
0C900000001No2016-08-06
1C900000002No2016-07-27
2C900000003No2016-07-06
3C900000004No2016-08-26
4C900000005No2016-08-28
info for  LC_TRAIN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9499 entries, 0 to 9498
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   CustomerID  9499 non-null   object
 1   BadLoan     9499 non-null   object
 2   date        9499 non-null   object
dtypes: object(3)
memory usage: 222.8+ KB
None
writing LC_TRAIN to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/LC_train.csv
******************************
table: LC_TEST
CustomerIDBadLoandate
0C900009501No2016-08-03
1C900009502No2016-07-23
2C900009503Yes2016-08-29
3C900009504Yes2016-07-01
4C900009505No2016-08-13
info for  LC_TEST
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   CustomerID  500 non-null    object
 1   BadLoan     500 non-null    object
 2   date        500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB
None
writing LC_TEST to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/LC_test.csv

Create a data store in AI Catalog

To register the data with DataRobot, you will need to authorize DataRobot to access Snowflake. To do so, create credentials authorizing AI Catalog access to this new database, and pass the JDBC driver information for Snowflake to create a Datastore in the AI Catalog. When a Datastore (the Snowflake database) is registered, the tables you want can be accessed, and version history metadata can be associated to any downstream project, model, deployment or prediction, going forward.

For more information, reference the DataRobot documentation for integrating with a Snowflake database.

In [5]:

# Find the JDBC driver ID from name
# Can be skipped if you have the ID--code is shown here for completeness
for d in dr.DataDriver.list():
    if d.canonical_name in "Snowflake (3.13.9 - recommended)":
        print((d.id, d.canonical_name))
('626bae0a98b54f9ba70b4122', 'Snowflake (3.13.9 - recommended)')
In [6]:

# Create a data store
# This step can be skipped if you have the data store ID--code is shown here for completeness
data_store = dr.DataStore.create(
    data_store_type="jdbc",
    canonical_name="Snowflake Demo DB",
    driver_id="626bae0a98b54f9ba70b4122",
    jdbc_url=db_url,
)
data_store.test(username=db_user, password=db_password)
Out [6]:

{'message': 'Connection successful'}

Create and store credentials to allow the AI Catalog access to this database. These can be found in the Data Connections tab under your profile in DataRobot.

In [7]:
cred = dr.Credential.create_basic(
    name="test_cred",
    user=db_user,
    password=db_password,
)

Create versioned datasets from your data store in Snowflake

After registering the Snowflake DB in AI Catalog, you can use the DataRobot API to access Snowflake tables.

To facilitate data access, DataRobot defines the following entities:

Data store: The system with the data in this case Snowflake. You created this in the previous step.
Data source: The query or table with the data. In this case, we have three: LC_PROFILELC_TRAIN and LC_TRANSACTIONS. We’ll set up LC_TEST later.
Dataset: A registered dataset for ML projects, which is a query or pull from the parent data source.

This structure allows us to track versions of data used for modeling and predictions. For each table, you will:

  • Create a new data source from a query of the table in the Snowflake DB.
  • Create a dataset, which is a versioned extract from that table (data source). You will use a dynamic extract policy here, but you can set up various snap-shotting policies as well.

For completeness, we explicitly create each dataset with individual code blocks. In practice, you can easily create a helper function for this. If you have dataset id’s in AI Catalog, you can proceed to project creation

In [8]:

# Create the target dataset
# Define the query
params = dr.DataSourceParameters(
    data_store_id=data_store.id,
    query="SELECT * FROM {db}.{schema}.LC_TRAIN".format(db=db, schema=schema),
)
# Establish the data source
data_source = dr.DataSource.create(
    data_source_type="jdbc", canonical_name="snowflake_lc_train", params=params
)
# Create an individual dataset
# New datasets can be made from the same data source in the future, with completely new unique identifiers for other projects
dataset_target = dr.Dataset.create_from_data_source(
    data_source.id,
    do_snapshot=False,  # Create a dynamic dataset from the datastore
    credential_id=cred.credential_id,
)

# One way to take a look at the data
# dataset_target.get_as_dataframe()

Use the snippet below to create the profile dataset.

In [9]:

params = dr.DataSourceParameters(
    data_store_id=data_store.id,
    query="SELECT * FROM {db}.{schema}.LC_PROFILE".format(db=db, schema=schema),
)
data_source = dr.DataSource.create(
    data_source_type="jdbc", canonical_name="snowflake_lc_profile", params=params
)
dataset_profile = dr.Dataset.create_from_data_source(
    data_source.id, do_snapshot=False, credential_id=cred.credential_id
)
# One way to quickly jump to the DataRobot GUI
# dataset_profile.open_in_browser()

Use the snippet below to create the transactions dataset.

In [10]:

params = dr.DataSourceParameters(
    data_store_id=data_store.id,
    query="SELECT * FROM {db}.{schema}.LC_TRANSACTIONS".format(db=db, schema=schema),
)
data_source = dr.DataSource.create(
    data_source_type="jdbc", canonical_name="snowflake_lc_transactions", params=params
)
dataset_trans = dr.Dataset.create_from_data_source(
    data_source.id, do_snapshot=False, credential_id=cred.credential_id
)

HTML(
    f"""<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{dataset_trans.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Dataset in DataRobot</a>
</div>"""
)

Out [10]:

Open App in DataRobot: https://app.datarobot.com/ai-catalog/63bafe804ffb1b5b6cacd976

You can extract metadata and statistics on the data that has been registered in AI Catalog.

In [11]:

features_from_dr = dataset_trans.get_all_features()

pd.DataFrame(
    [
        {
            "Feature Name": f.name,
            "Feature Type": f.feature_type,
            "Unique Count": f.unique_count,
            "NA Count": f.na_count,
            "Min": f.min,
            "Mean": f.mean,
            "Median": f.median,
            "Max": f.max,
        }
        for f in features_from_dr
    ]
)

Out [11]:

)

Out[11]:

Feature NameFeature TypeUnique CountNA CountMinMeanMedianMax
0AccountIDText282050NoneNoneNoneNone
1AmountNumeric1805787090.013877.763741899723
2CustomerIDCategorical93110NoneNoneNoneNone
3DateDate8713302016-06-012016-07-142016-07-142016-08-26
4DescriptionCategorical15318NoneNoneNoneNone

Configure Time-aware Feature Engineering

To set up a Feature Discovery project from the API, you want to create a DataRobot Project object so you can define all of the relationships between datasets.

  • Create the project with the primary dataset (this will always have your target variable), provide database credentials, and a project name.
  • Configure which feature engineering operators should be explored during Feature Discovery across datasets.

Create a DataRobot project

In [12]:

project = dr.Project.create_from_dataset(
    dataset_target.id,  # unique ID of the target dataset we just created
    credential_id=cred.credential_id,  # don't forget the snowflake credentials
    project_name="Snowflake Lending Club API",
)

Store your Feature Discovery settings as a variable, which can be passed to a project’s Advanced Options before you start building features and modeling. The time windows to derive these is defined in later cells. This is very useful if you want to constrain or experiment certain feature derivation types. Importantly, this list uses only non-UDF features, which can be exported and executed via Spark SQL.

In [13]:

# Define the type of feature engineering you want to explore
feature_discovery_settings_no_udf = [
    {"name": "enable_days_from_prediction_point", "value": True},
    {"name": "enable_hour", "value": True},
    {"name": "enable_categorical_num_unique", "value": False},
    {"name": "enable_categorical_statistics", "value": False},
    {"name": "enable_numeric_minimum", "value": True},
    {"name": "enable_token_counts", "value": False},
    {"name": "enable_latest_value", "value": True},
    {"name": "enable_numeric_standard_deviation", "value": True},
    {"name": "enable_numeric_skewness", "value": False},
    {"name": "enable_day_of_week", "value": True},
    {"name": "enable_entropy", "value": False},
    {"name": "enable_numeric_median", "value": True},
    {"name": "enable_word_count", "value": False},
    {"name": "enable_pairwise_time_difference", "value": True},
    {"name": "enable_days_since_previous_event", "value": True},
    {"name": "enable_numeric_maximum", "value": True},
    {"name": "enable_numeric_kurtosis", "value": False},
    {"name": "enable_most_frequent", "value": False},
    {"name": "enable_day", "value": True},
    {"name": "enable_numeric_average", "value": True},
    {"name": "enable_summarized_counts", "value": False},
    {"name": "enable_missing_count", "value": True},
    {"name": "enable_record_count", "value": True},
    {"name": "enable_numeric_sum", "value": True},
]

Define the relationships between datasets

Define relationships between datasets for joins and the primary time field to ensure correct time-awareness for feature engineering. DataRobot will automatically handle variations in time data between datasets (e.g. minutes–> hours). The following snippet defines how each dataset relates to each other, and how far back in time to derive features.

  • Specify the relationships (join-keys) between the secondary and primary datasets, and the Feature Derivation Window (FDW) to explore.
  • You can have multiple relationships between datasets (e.g., a join on customer ID and a separate join on Products to look at aggregate behavior at the product vs. customer level).
  • The ‘primary_temporal_key’ will be the date-time field used to respect the prediction point (defined at project start).
  • You can define multiple time windows to explore long and short-term effects.
  • DataRobot will automatically recognize data is coming from Snowflake, and push-down compute to Snowflake to accelerate feature engineering.
  • For further reading, reference the documentation for Feature Discovery and time series applications.
In [14]:

########################### Define the datasets for feature engineering ###############################
# Store the dataset AI Catalog IDs as a variable for simplicity
profile_catalog_id = dataset_profile.id
profile_catalog_version_id = dataset_profile.version_id

transac_catalog_id = dataset_trans.id
transac_catalog_version_id = dataset_trans.version_id

# Define the secondary datasets
# Defines the dataset identifier, temporal key, and snapshot policy for the secondary datasets
profile_dataset_definition = dr.DatasetDefinition(
    identifier="profile",
    catalog_id=profile_catalog_id,
    catalog_version_id=profile_catalog_version_id,
    snapshot_policy="dynamic"  # requires a jdbc source
    # feature_list_id='607cd4d362fc0cc7c8bc04cd', can be used to set a different feature_list
)

transaction_dataset_definition = dr.DatasetDefinition(
    identifier="transactions",
    catalog_id=transac_catalog_id,
    catalog_version_id=transac_catalog_version_id,
    primary_temporal_key="Date",
    snapshot_policy="dynamic",  # requires a jdbc source
)

########################## Define the join relationships and Feature Derivation Windows #########################

profile_transaction_relationship = dr.Relationship(
    dataset1_identifier="profile",  # join profile
    dataset2_identifier="transactions",  # to transactions
    dataset1_keys=["CustomerID"],  # on CustomerID
    dataset2_keys=["CustomerID"],
)
# You do not need to specify dataset1Identifier when joining with the primary dataset
primary_profile_relationship = dr.Relationship(  # join primary dataset
    dataset2_identifier="profile",  # to profile
    dataset1_keys=["CustomerID"],  # on CustomerID
    dataset2_keys=["CustomerID"],
    # feature_derivation_window_start=-14,
    # feature_derivation_window_end=-1,
    # feature_derivation_window_time_unit='DAY',
    feature_derivation_windows=[
        {"start": -7, "end": -1, "unit": "DAY"},
        {"start": -14, "end": -1, "unit": "DAY"},
        {"start": -30, "end": -1, "unit": "DAY"},
    ],  # example of multiple FDW
    prediction_point_rounding=1,
    prediction_point_rounding_time_unit="DAY",
)
# Store datasets and relationships as a list for config settings
dataset_definitions = [profile_dataset_definition, transaction_dataset_definition]
relationships = [primary_profile_relationship, profile_transaction_relationship]

# Create the relationships configuration to define the connection between the datasets
# This will be passed to the DataRobot project configuration for Autopilot
relationship_config = dr.RelationshipsConfiguration.create(
    dataset_definitions=dataset_definitions,
    relationships=relationships,
    feature_discovery_settings=feature_discovery_settings_no_udf,
)

The resulting configuration will appear in DataRobot as seen below. The following section of code will activate the feature engineering displayed in the top row.

FD Config

Build models

So far you have:

  • Registered your datasets in the AI Catalog and defined their relationships
  • Defined the types of feature engineering and time frames to explore

This configuration is stored in time and can be reused. To start modeling, you want to:

  • pass the newly defined relationship config from above
  • since you are using dynamic datasets, you must pass your credentials as each primary dataset extract triggers a query of the secondary datasets
  • note: The first line toggles Supervised Feature Reduction (SFR). The Feature Discovery step creates and explores a wide range of features (can be hundreds or thousands), and SFR will intelligently weed out low-information features. The remaining features will be formed into one dataset of candidate features in your project (which you will further reduce)

Once complete, you will initiate Autopilot with DataRobot to build models.

In [15]:

# SFR default is True, shown here as an example of how to set SFR using the API
advanced_options = dr.AdvancedOptions(
    feature_discovery_supervised_feature_reduction=True
)


# partitioning_spec = dr.DatetimePartitioningSpecification('date')
# You can use prediction point instead

project.analyze_and_model(
    target="BadLoan",
    relationships_configuration_id=relationship_config.id,
    credentials=[
        {
            # only needed for snowflake dynamic datasets
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_profile.version_id,
        },
        {
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_trans.version_id,
        },
    ],
    feature_engineering_prediction_point="date",  # The prediction point is defined here. This maps to the 'primary_temporal_key' in the Relationship Config above
    advanced_options=advanced_options,
)

project.set_worker_count(-1)  # Use all available workers

project.wait_for_autopilot()
In progress: 9, queued: 0 (waited: 0s)
In progress: 9, queued: 0 (waited: 1s)
In progress: 9, queued: 0 (waited: 2s)
In progress: 9, queued: 0 (waited: 3s)
In progress: 9, queued: 0 (waited: 5s)
In progress: 9, queued: 0 (waited: 7s)
In progress: 9, queued: 0 (waited: 11s)
In progress: 9, queued: 0 (waited: 19s)
In progress: 8, queued: 0 (waited: 32s)
In progress: 2, queued: 0 (waited: 53s)
In progress: 2, queued: 0 (waited: 74s)
In progress: 1, queued: 0 (waited: 95s)
In progress: 1, queued: 0 (waited: 116s)
In progress: 1, queued: 0 (waited: 137s)
In progress: 1, queued: 0 (waited: 158s)
In progress: 1, queued: 0 (waited: 179s)
In progress: 1, queued: 0 (waited: 199s)
In progress: 1, queued: 0 (waited: 220s)
In progress: 1, queued: 0 (waited: 241s)
In progress: 1, queued: 0 (waited: 262s)
In progress: 1, queued: 0 (waited: 283s)
In progress: 1, queued: 0 (waited: 304s)
In progress: 1, queued: 0 (waited: 325s)
In progress: 1, queued: 0 (waited: 345s)
In progress: 1, queued: 0 (waited: 366s)
In progress: 1, queued: 0 (waited: 387s)
In progress: 1, queued: 0 (waited: 408s)
In progress: 1, queued: 0 (waited: 429s)
In progress: 1, queued: 0 (waited: 449s)
In progress: 1, queued: 0 (waited: 470s)
In progress: 1, queued: 0 (waited: 491s)
In progress: 1, queued: 0 (waited: 512s)
In progress: 1, queued: 0 (waited: 533s)
In progress: 1, queued: 0 (waited: 554s)
In progress: 1, queued: 0 (waited: 575s)
In progress: 1, queued: 0 (waited: 596s)
In progress: 1, queued: 0 (waited: 617s)
In progress: 1, queued: 0 (waited: 637s)
In progress: 1, queued: 0 (waited: 658s)
In progress: 4, queued: 0 (waited: 679s)
In progress: 4, queued: 0 (waited: 700s)
In progress: 1, queued: 0 (waited: 721s)
In progress: 16, queued: 0 (waited: 742s)
In progress: 16, queued: 0 (waited: 762s)
In progress: 16, queued: 0 (waited: 784s)
In progress: 2, queued: 0 (waited: 804s)
In progress: 1, queued: 0 (waited: 825s)
In progress: 0, queued: 0 (waited: 846s)
In progress: 0, queued: 0 (waited: 867s)
In progress: 0, queued: 0 (waited: 888s)
In progress: 0, queued: 0 (waited: 909s)
In progress: 0, queued: 0 (waited: 930s)
In progress: 5, queued: 0 (waited: 951s)
In progress: 5, queued: 0 (waited: 972s)
In progress: 1, queued: 0 (waited: 992s)
In progress: 1, queued: 0 (waited: 1013s)
In progress: 1, queued: 0 (waited: 1034s)
In progress: 1, queued: 0 (waited: 1055s)
In progress: 1, queued: 0 (waited: 1076s)
In progress: 1, queued: 0 (waited: 1096s)
In progress: 2, queued: 0 (waited: 1118s)
In progress: 1, queued: 0 (waited: 1139s)
In progress: 0, queued: 0 (waited: 1159s)
In progress: 0, queued: 0 (waited: 1180s)
In progress: 0, queued: 0 (waited: 1201s)
In progress: 0, queued: 0 (waited: 1222s)

Model Selection

Identify the top performing model

Below is a baseline approach to extract insights from the model object, and build a deployment pipeline. You will:

  • Find the model that is recommended for deployment. This is based on the lowest error of your project metric for cross-validation.
  • Plot the top ten features used in the recommended model.
  • Observe how many of the features used have been generated by DataRobot automatically.

Note that any insight in the DataRobot app can be accessed via the API, and model analysis is limited here to focus on the end-to-end pipeline in this tutorial

Reference the DataRobot documentation for more information on feature aggregations.

In [16]:

# The model used for predictions
model = project.recommended_model()

display(
    HTML(
        f"""<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{model.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">{model.model_type}</a>
</div>"""
    )
)
print("The top performing model was", model.model_type)
print(
    "Feature list used:",
    model.featurelist_name,
    ", containing",
    len(model.get_features_used()),
    "features",
)
print("*" * 10)

pd.DataFrame(model.metrics)

Light Gradient Boosted Trees Classifier with Early Stopping
The top performing model was Light Gradient Boosted Trees Classifier with Early Stopping Feature list used: DR Reduced Features M19 , containing 41 features

**********

Out[16]:

AUCArea Under PR CurveFVE BinomialGini NormKolmogorov-SmirnovLogLossMax MCCRMSERate@Top10%Rate@Top5%Rate@TopTenth%
validation0.8038300.466170.2120900.6076600.4907400.2845300.380210.2862800.4539500.6315801.0
crossValidation0.8042380.454200.2076660.6084760.4956580.2861260.377240.2875380.4578960.6052640.9
holdout0.8321400.496620.2431400.6642800.5373000.2738100.429220.2819800.5105300.7052600.5
trainingNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
backtestingScoresNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
backtestingNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Note that the top model was trained on the DR Reduced Features M19 feature list. This is the result of built-in feature reduction that removes features with correlated permutation importance for the top performing model from Autopilot. This was reduced from Model 19 (M19), which by default used the Informative Features feature list.

You also enabled Supervised Feature Reduction in advanced settings, which occurs after target selection but prior to running any models. In terms of total features, so far you have gone from 24 Features in your original three tables, to:

138 explored features (from the Feature Discovery tab)
75 features that progressed passed supervised feature reduction
81 Informative Features (including 6 from the initial data set)
41 in DR Reduced Features M19

You can imagine how the feature space grows in size with additional data and the time-savings created by iterating quickly on datasets.

You can quickly compare the performance between the baseline (Informative Features) and reduced feature lists.

In [102]:

print(
    "Best performing blueprint LogLoss (5 fold CV at 64% of training data, Informative Features):",
    dr.Model.get(project.id, "63bb034a1f65cbf40f2e2031").metrics["LogLoss"][
        "crossValidation"
    ],
)
print(
    "Best performing blueprint LogLoss (5 fold CV at 64% of training data, DR Reduced Features):",
    dr.Model.get(project.id, "63bb0458d98255aeb86ca839").metrics["LogLoss"][
        "crossValidation"
    ],
)
Best performing blueprint LogLoss (5 fold CV at 64% of training data, Informative Features): 0.289138
Best performing blueprint LogLoss (5 fold CV at 64% of training data, DR Reduced Features): 0.288886

Evaluate features

In the following cell you will plot permutation importance of the top ten features.

In [19]:

# Plot permutation based importance (Feature Impact)
plt_data = (
    pd.DataFrame.from_records(model.get_or_request_feature_impact())
    .sort_values("impactNormalized", ascending=True)
    .tail(10)
)  # top 10, can remove to see all features
ax = plt.barh(plt_data["featureName"], plt_data["impactNormalized"])
plt.title(f"{project.project_name} - {model.model_type}")
plt.show()
download 10

Note how many of the top 10 features were derived from Feature Discovery and their diversity. Features such as the standard deviation of the number of days between transcations at the CustomerID level, over a 29 day window (shown below), capturing the variation in monthly spending behavior, provide signal and also could be prone to error in SQL or python.

Notice that the 29 (not 30) days comes from how you defined the Feature Derivation window as 30 days up to 1 day from the prediction point, in the primary_profile_relationship. As you move beyond your MVP model and want to experiment with more datasets and historical time horizons, it can often be beneficial to create multiple projects programatically to evaluate Feature Derivation Window combinations vs trying every combination in one project. This step-wise approach to adding data can simplify feature selection and also helps in determening your ultimate feature set. There is no silver bullet, but our AI Accelerator with ML Flow can assist with evaluation.

lineage complex

It’s also possible to view the feature lineage via the API.

In [103]:

# function to take in a feature from a project and return the lineagae


def get_lineage(proj, fname):
    lineage = dr.Feature.get(project.id, fname)
    feat_eng = dr.models.FeatureLineage.get(
        project.id, lineage.feature_lineage_id
    ).steps

    return pd.DataFrame.from_dict(feat_eng, orient="columns")


get_lineage(project, "profile[purpose]")
# transactions (days since previous event by CustomerID) (29 days std)

# transactions (days since previous event by CustomerID) (29 days std)

Out[103]:

idstep_typeparentsnamedata_typeis_time_awarejoin_infocatalog_idcatalog_version_idcolumns
00generatedColumn[]profile[purpose]CategoricalNaNNaNNaNNaNNaN
11join[0]NaNNaNFalse{‘join_type’: ‘left’, ‘left_table’: {‘datasteps’: [3], ‘columns’: [‘CustomerID’]}, ‘right_table’: {‘datasteps’: [2], ‘columns’: [‘CustomerID’]}}NaNNaNNaN
22data[1]profileNaNNaNNaN63bafe3a59d6375c9ee0bced63bafe3a59d6375c9ee0bcee[{‘name’: ‘purpose’, ‘data_type’: ‘Categorical’, ‘is_input’: True, ‘is_cutoff’: False}, {‘name’: ‘CustomerID’, ‘data_type’: ‘Categorical’, ‘is_input’: False, ‘is_cutoff’: False}]
33data[1]Primary datasetNaNNaNNaNNaNNaN[{‘name’: ‘CustomerID’, ‘data_type’: ‘Categorical’, ‘is_input’: False, ‘is_cutoff’: False}]

Finally, let’s examine the full diversity of features used in the model we will deploy. The code below presents all features from the DR Reduced Features M19 feature list, sorted by their normalized permutation importance

In [21]:

import pandas as pd

pd.set_option("display.max_colwidth", None)
plt_data = pd.DataFrame.from_records(model.get_or_request_feature_impact()).sort_values(
    "impactNormalized", ascending=False
)
plt_data

Out[21]:

featureNameimpactNormalizedimpactUnnormalizedredundantWith
0profile[int_rate]1.0000000.025971None
1profile[annual_inc]0.9551030.024805None
2profile[term]0.6388110.016591None
3transactions (days since previous event by CustomerID) (29 days median)0.3963260.010293None
4transactions (days since previous event by CustomerID) (29 days std)0.3517350.009135None
5date (days from transactions[Date]) (13 days avg)0.3421960.008887None
6profile[purpose]0.2931250.007613None
7transactions[Date] (Day of Week) (29 days unique count)0.2790810.007248None
8transactions[Amount] (13 days max)0.2698870.007009None
9transactions (days since previous event by CustomerID) (6 days avg)0.2492750.006474None
10transactions[Amount] (29 days min)0.2290610.005949None
11transactions[Amount] (29 days median)0.2161900.005615None
12profile[sub_grade]0.2056860.005342None
13date (days from transactions[Date]) (29 days sum)0.1959620.005089None
14transactions[Amount] (6 days median)0.1945520.005053None
15transactions[Amount] (13 days latest)0.1862550.004837None
16transactions (days since previous event by CustomerID) (13 days std)0.1781520.004627None
17date (days from transactions[Date]) (6 days sum)0.1765530.004585None
18transactions (29 days count)0.1734410.004505None
19date (days from transactions[Date]) (13 days std)0.1653640.004295None
20profile[zip_code]0.1621850.004212None
21date0.1514210.003933None
22transactions[Amount] (29 days avg)0.1506140.003912None
23transactions[Amount] (29 days missing count)0.1497890.003890None
24profile[installment]0.1428180.003709None
25date (days from transactions[Date]) (29 days avg)0.1398150.003631None
26date (days from transactions[Date]) (29 days median)0.1393490.003619None
27profile[grade]0.1379710.003583None
28transactions[Date] (Day of Month) (29 days latest)0.1357900.003527None
29transactions[Amount] (29 days latest)0.1338540.003476None
30profile[funded_amnt]0.1126570.002926None
31transactions[Description] (29 days latest)0.1072000.002784None
32profile[loan_amnt]0.0985250.002559None
33transactions[Amount] (6 days latest)0.0941020.002444None
34transactions (days since previous event by CustomerID) (29 days min)0.0902490.002344None
35profile[emp_title]0.0880790.002288None
36transactions[Description] (13 days latest)0.0755530.001962None
37transactions[Amount] (6 days min)0.0736700.001913None
38transactions[Amount] (6 days sum)0.0730980.001898None
39profile[emp_length]0.0440620.001144None

As we mentioned in the Problem Framing, you generally may not pursue more complex features beyond what is available in past projects or a feature store, or want to deal with data types like text right away, however, in this case the text feature Purpose in the LC_Profile table was in the top 10 for adding signal. Further, we disabled numerous feature engineering steps in our Feature Derivation Configuration. The takeaway here is that by leveraging automation we are able to more efficiently progress our MVP and explore relevant features, saving important cycle time in the initial phase of modeling.

The cell below shows how you can examine a blueprint step in your notebook. In this case, the text processing pipeline of DataRobot is highly efficient and also non-trivial.

In [22]:

print(
    "Feature engineering processes: ", model.blueprint.processes
)  # quick view of model level feature engineering processes. These can be explored in great detail
print("*" * 10)
print("Example explanation of text modeling step: ", model.blueprint.processes[3])
print("*" * 10)
print(model.get_model_blueprint_documents()[3].description)
Feature engineering processes:  ['Ordinal encoding of categorical variables', 'Missing Values Imputed', 'Converter for Text Mining', 'Auto-Tuned Word N-Gram Text Modeler using token occurrences', 'Light Gradient Boosted Trees Classifier with Early Stopping']
**********
Example explanation of text modeling step:  Auto-Tuned Word N-Gram Text Modeler using token occurrences
**********
Fit a single-word n-gram model to each text feature in the input dataset, then use the predictions from these models as inputs to an ElasticNet classifier.
Word N-Gram:
A word n-gram model is a special case of n-gram modeling, where the n-gram is a contiguous sequence of n items from a given sequence of text (an “item” is a single word). Word n-gram models first construct a “vocabulary” of all sequences of words in the input dataset and then make a matrix with counts of n-gram occurrences in each row.
Auto-Tuned N-Gram:
Auto-tuned n-grams automatically decide the size of n-gram to use, trying unigrams (or a bag-of-words model), bigrams, and trigrams, and then selecting the method with the best score on an out-of sample dataset. Unigrams, in this case, are individual words in the corpus. Bigrams will be any pair of words, such as “thank you” or “new york”. Trigrams are trios of sequential words, for example, “how are you”, “old brown dog”, “long, long ago”.
Further Text Processing:
This task can also use the TF-IDF transformation, which weights a word or sequence more highly if it appears more frequently within a document. It downweights words that appear more frequently between documents. This transformation performs well in a wide variety of downstream problems.
The task combines a text-vectorizer with an ElasticNet Regression estimator to enable gridsearches that select the best text-vectorization parameters. It includes the advanced settings option of running Singular Value Decomposition (SVD) on the vectorized text features.
If Japanese text is detected, the task uses MeCab to tokenize the text prior to finding word n-grams.

Configure Your Deployment Pipeline

Once you decide on a model and set of features, DataRobot will automatically handle time-aware derivation in production with your prediction data. To ensure your training and serving pipelines are correct, you need to define how to add in new data at prediction time. This involves:

  • Defining a secondary dataset configuration (the relationships between datasets)
  • Creating a dataset for the test data in AI catalog.
In [23]:

# The following steps allow you to use new secondary datasets during prediction
# Get the default config as reference, in an existing project with multiple configurations you should iterate over SecondaryDatasetConfigurations and use "is_default"
# The cell below outputs the format of the training configuration

default_dataset_config = dr.SecondaryDatasetConfigurations.list(project_id=project.id)[
    0
]
config = default_dataset_config.to_dict().copy()
config
Out [23]:

{'id': '63bb0026a1d008b60ebafa84',
 'project_id': '63bafedb82f0196b30e19589',
 'config': [{'feature_engineering_graph_id': '63baff65a1d008b60ebafa31',
   'secondary_datasets': [{'identifier': 'transactions',
     'catalog_id': '63bafe804ffb1b5b6cacd976',
     'catalog_version_id': '63bafe804ffb1b5b6cacd977',
     'snapshot_policy': 'dynamic'},
    {'identifier': 'profile',
     'catalog_id': '63bafe3a59d6375c9ee0bced',
     'catalog_version_id': '63bafe3a59d6375c9ee0bcee',
     'snapshot_policy': 'dynamic'}]}],
 'name': 'Default Configuration',
 'secondary_datasets': [{'identifier': 'transactions',
   'catalog_id': '63bafe804ffb1b5b6cacd976',
   'catalog_version_id': '63bafe804ffb1b5b6cacd977',
   'snapshot_policy': 'dynamic'},
  {'identifier': 'profile',
   'catalog_id': '63bafe3a59d6375c9ee0bced',
   'catalog_version_id': '63bafe3a59d6375c9ee0bcee',
   'snapshot_policy': 'dynamic'}],
 'creator_full_name': 'Chandler McCann',
 'creator_user_id': '595f915efc834a875328e54f',
 'created': datetime.datetime(2023, 1, 8, 17, 40, 54, tzinfo=tzutc()),
 'featurelist_id': None,
 'credential_ids': [{'credential_id': '63bafe017e36980b9ce0be7f',
   'catalog_version_id': '63bafe804ffb1b5b6cacd977'},
  {'credential_id': '63bafe017e36980b9ce0be7f',
   'catalog_version_id': '63bafe3a59d6375c9ee0bcee'}],
 'is_default': True,
 'project_version': None}

Define datasets and relationships for scoring

Your deployment pipeline could A) point to the same dataset used in training, but grab the newest records or B) point to a different dataset for scoring that uses the same schema. The relationship config of the original transactions table is used below for demonstration. In a production setting, this would pull the correct time-aware data as new transactions were added.

In [24]:

# transactions_secondary_dataset = default_dataset_config.secondary_datasets[1]
# transactions_secondary_dataset.to_dict()
In [25]:

# Create a "new" secondary dataset - this will reuse the transactions dataset here as an example
# You would update with your scoring data, and ensure the schema is the same as training

transactions_predict_dataset = dr.models.secondary_dataset.SecondaryDataset(
    catalog_id=dataset_trans.id,
    catalog_version_id=dataset_trans.version_id,  # Complete version lineage for predictions
    identifier="transactions",
    snapshot_policy="dynamic",  # Fetch the latest database records
)
transactions_predict_dataset.to_dict()
Out [25]:

{'identifier': 'transactions',
 'catalog_id': '63bafe804ffb1b5b6cacd976',
 'catalog_version_id': '63bafe804ffb1b5b6cacd977',
 'snapshot_policy': 'dynamic'}
In [26]:

# Create a prediction config with the original profile dataset (static) , and a "new" transactions prediction dataset

predict_config = dr.SecondaryDatasetConfigurations.create(
    project_id=project.id,
    name="predict config",
    featurelist_id=model.featurelist_id,
    secondary_datasets=[
        profile_dataset_definition.to_dict().to_dict(),
        transactions_predict_dataset.to_dict(),
    ],
)

Create a test dataset.

Reminder: this is simply a held-out slice of the LC_Train table for example purposes. When reusing this code, point the query to your scoring data and ensure the table configuration matches training.

In [27]:

params = dr.DataSourceParameters(
    data_store_id=data_store.id,
    query="SELECT * FROM {db}.{schema}.LC_TEST".format(db=db, schema=schema),
)
data_source = dr.DataSource.create(
    data_source_type="jdbc", canonical_name="snowflake_lc_test", params=params
)
dataset_test = dr.Dataset.create_from_data_source(
    data_source.id, do_snapshot=True, credential_id=cred.credential_id
)

Make predictions for validation locally

Before deploying a model into MLOps, you can still call predictions against a model and make sure everything is set up properly. This method scores our data against the model object, without creating a full deployment.

In [28]:

# Prepare to get predictions for the test dataset. It must use the predict config that uses dynamic transactions and credentials for JDBC
# dataset = project.upload_dataset(sourcedata="./data_to_predict.csv",
# Uncomment and use if you decide to use a local test file
dataset = project.upload_dataset_from_catalog(
    dataset_id=dataset_test.id,
    credentials=[
        {
            # Only needed for snowflake dynamic datasets
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_profile.version_id,
        },
        {
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_trans.version_id,
        },
    ],
    secondary_datasets_config_id=predict_config.id,
)

pred_job = model.request_predictions(dataset.id)
preds = pred_job.get_result_when_complete()
In [29]:

# To inspect
preds.head()

Out[29]:

row_idpredictionpositive_probabilityprediction_thresholdclass_Noclass_Yes
00No0.2018680.50.7981320.201868
11No0.1301640.50.8698360.130164
22No0.4445300.50.5554700.444530
33No0.0304350.50.9695650.030435
44No0.0995530.50.9004470.099553

Export Feature Discovery

You can download the training dataset with the discovered features. You can also download the Spark SQL code recipe. Note that in the feature_discovery_settings_no_udf call, you disabled custom User Defined Functions for certain advanced feature engineering techniques, such as entropy of categoricals over time and all summarized categorical data types.

In [33]:

# Download the training datase with the discovered features
project.download_feature_discovery_dataset(file_name="./FD_download.csv")
In [34]:

# You can download the Spark SQL code recipe to generate the discovered features
project.download_feature_discovery_recipe_sqls(file_name="./recipe.sql")

Create a Deployment

Below you are going to identify a prediction server (to which you can make API calls), as well as deploy your model.

Creating a Deployment will generate a complete pipeline from your three Snowflake tables, including all of the time-aware feature creation. It will also expose a production grade REST API for scoring against your deployed model, automating all of the feature engineering within the blueprint.

You need to:

  • Select a prediction server.
  • Deploy the model in to the Model Registry and MLOps.
  • Update deployment with new secondary dataset config. This allows us to ensure scoring data pipelines are consistent.
In [30]:

# You can view the available prediciton servers.
dr.PredictionServer.list()
Out [30]:

[PredictionServer(https://mlops.dynamic.orm.datarobot.com),
 PredictionServer(https://datarobot-cfds.dynamic.orm.datarobot.com),
 PredictionServer(https://cfds-ccm-prod.orm.datarobot.com)]
In [31]:

prediction_server = dr.PredictionServer.list()[0]

# Create the deployment
deployment = dr.Deployment.create_from_learning_model(
    model_id=model.id,
    description="A new SAFER deployment using the API",
    prediction_threshold=model.prediction_threshold,  # can configure and adjust
    label="Lending Club SAFER with dynamic transactions",
    default_prediction_server_id=prediction_server.id,
)
deployment
Out [31]:

Deployment(Lending Club SAFER with dynamic transactions)
In [32]:

# Update deployment to use predict config
deployment.update_secondary_dataset_config(
    predict_config.id,
    credential_ids=[
        {
            # Only needed for Snowflake dynamic datasets
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_profile.version_id,
        },
        {
            "credentialId": cred.credential_id,
            "catalogVersionId": dataset_trans.version_id,
        },
    ],
)

HTML(
    f"""<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{deployment.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Deployment in DataRobot</a>
</div>"""
)

Out [31]:

Open Deployment in DataRobot

Score from Snowflake and make batch predictions

There are two ways of making batch predictions with the deployment. The first is to use the User OAuth JDBC connection you created in previous steps. The data will be saved to DataRobot and it can be accessed directly.

In the this workflow you will save a local file and inspect it. Note that depending on your use case, there are lots of ways to optimize latency as needed.

In [41]:
preds.columns
Out [41]:

Index(['row_id', 'prediction', 'positive_probability', 'prediction_threshold',
       'class_No', 'class_Yes'],
      dtype='object')
In [39]:

batch_job = dr.BatchPredictionJob.score(
    deployment.id,
    intake_settings={
        "type": "dataset",
        "dataset": dr.Dataset.get(dataset_test.id),
    },
    output_settings={
        "type": "localFile",
        "path": "./predicted.csv",
    },
)
In [ ]:

!! head ./predicted.csv
['BadLoan_Yes_PREDICTION,BadLoan_No_PREDICTION,BadLoan_PREDICTION,THRESHOLD,POSITIVE_CLASS,DEPLOYMENT_APPROVAL_STATUS',
 '0.2018676852,0.7981323148,No,0.5,Yes,APPROVED',
 '0.1301639109,0.8698360891,No,0.5,Yes,APPROVED',
 '0.4445297652,0.5554702348,No,0.5,Yes,APPROVED',
 '0.0304346035,0.9695653965,No,0.5,Yes,APPROVED',
 '0.0995533859,0.9004466141,No,0.5,Yes,APPROVED',
 '0.4984286285,0.5015713715,No,0.5,Yes,APPROVED',
 '0.0242088347,0.9757911653,No,0.5,Yes,APPROVED',
 '0.0198749244,0.9801250756,No,0.5,Yes,APPROVED',
 '0.2325775747,0.7674224253,No,0.5,Yes,APPROVED']
In [77]:

# compare to our manual predictions
df = pd.read_csv("predicted.csv")  # batch predictions
print(
    pd.DataFrame(
        data={"batch_preds": df.BadLoan_No_PREDICTION, "manual_preds": preds.class_No}
    ).head()
)
# evaluate predictions within 1e-10 tolerance
np.allclose(df.BadLoan_No_PREDICTION, preds.class_No, rtol=1e-10)
   batch_preds  manual_preds
0     0.798132      0.798132
1     0.869836      0.869836
2     0.555470      0.555470
3     0.969565      0.969565
4     0.900447      0.900447
Out [77]:
True

Alternatively, you can write directly back to Snowflake, which will log and govern your prediction history in AI Catalog for versioning, traceability, etc.

In previous steps, note that the you only pass in the primary dataset, dataset_test, in the intake_settings call. The secondary datasets you just defined, dataset_trans and dataset_profile are stored in the deployment configuration. This creates simplicity in your pipeline and minimizes net-new table creation from your source tables for training data for each project. Uncomment if you if you would like to test and have appropriate Snowflake permissions.

In [ ]:

#     Replace the intake and outake, respectively, into the body of the `batch_job` call above if you would like to:

#     A) score with a local file but still reference the secondary datasets in Snowflake for feature generation
#     intake_settings={
#     'type': 'localFile',
#     'file': './data_to_predict.csv',
#     },

#     B) Write the predictions back to Snowflake
#     output_settings = {
#        'type':'jdbc',
#        'data_store_id': data_store.id,
#        'credential_id': cred.credential_id,
#        'statement_type': 'insert',
#        'table': 'PREDICTIONS',
#        'schema': 'PUBLIC',
#        'catalog': 'TEST_DB',
#        'create_table_if_not_exists': True
#     }
#       C) You can also add pass through columns and prediction explanations for each record by adding the below to the output settings
#         passthrough_columns_set = 'all', # Use if you want to include the input features in the output file (needed in database writeback if table doesn't exist)
#         max_explanations = 3, # Can go to 10 per record
#         download_timeout = 10*60,

Delete project artifacts

When you create authenticated data sources and downstream objects (like models or deployments) that depend on them, DataRobot has controls in place to prevent you from accidentally deleting a needed resource. As such, the order in which you clean-up your files from a project matters. The script below will loop through artifacts associated with your API token and remove them. This can be extended to other projects.

In [ ]:

def clean_notebook():
    # delete deployment
    for deployment in dr.Deployment.list():
        if deployment.label in "Lending Club SAFER with dynamic transactions":
            print("deleting " + deployment.label + " deployment id:" + deployment.id)
            deployment.delete()

    # delete project
    for p in dr.Project.list(
        search_params={"project_name": "Snowflake Lending Club API"}
    ):
        print("deleting " + p.project_name + " project id:" + p.id)
        p.delete()

    # delete datasets
    for d in dr.Dataset.list():
        if d.name in (
            "snowflake_lc_train",
            "snowflake_lc_test",
            "snowflake_lc_profile",
            "snowflake_lc_transactions",
        ):
            print("deleting " + d.name + " dataset id:" + d.id)
            dr.Dataset.delete(d.id)

    # delete datasources
    for d in dr.DataSource.list():
        if d.canonical_name in (
            "snowflake_lc_train",
            "snowflake_lc_test",
            "snowflake_lc_profile",
            "snowflake_lc_transactions",
        ):
            print("deleting " + d.canonical_name + " datasource id:" + d.id)
            d.delete()

    # delete datastore
    for d in dr.DataStore.list():
        if d.canonical_name == "Snowflake Demo DB":
            print("deleting " + d.canonical_name + " datastore id:" + d.id)
            d.delete()

    # delete credentials
    for c in dr.Credential.list():
        if c.name in "test_cred":
            print("deleting " + c.name + " credential id:" + c.credential_id)
            c.delete()
In []:

# clean_notebook()
deleting Lending Club SAFER with dynamic transactions deployment id:63b628f43b074e6da16e58a9
deleting Snowflake Lending Club API project id:63b623915e42de1677f95a17
deleting snowflake_lc_test dataset id:63b62806b2add8d9cee0bfa0
deleting snowflake_lc_transactions dataset id:63b62341fba9e95255acda06
deleting snowflake_lc_profile dataset id:63b622f9dfbe01e35fe0c0da
deleting snowflake_lc_train dataset id:63b622b35d09e17805acdb79
deleting snowflake_lc_train datasource id:63b622b2c1f151aca73dc1f6
deleting snowflake_lc_profile datasource id:63b622f9c1f151aca73dc1f7
deleting snowflake_lc_transactions datasource id:63b623408fa80dd91b439852
deleting snowflake_lc_test datasource id:63b62806a48166415a1ff294
deleting Snowflake Demo DB datastore id:63b622adc1f151aca73dc1f5
deleting test_cred credential id:63b622b2139f9c75d745700d

Summary

The code above provides a template that can be expanded upon for nearly all common databases to create a complete modeling and production workflow with multiple datasets. One major benefit of this approach is that the entire data engineering and production ML pipeline is taken care of. Data quality checks can be monitored at the front-end and back-end as well with ML Ops.

Conceptually, you saw how discovering diverse features that capture patterns in the data can be accelerated via automation. The point here is about reducing cycle time to improve from a baseline, not automating away domain knowledge. Becoming familiar with techniques like this can increase the amount of time you spend working with stakeholders to inject domain expertise into your feature engineering and model evaluation workflows.

For further exploration, experiment with the Feature Discovery settings and enable all types, which will produce even more diverse features, including the data type Summarized Categorical, or try experimenting with different time horizons programmatically and evaluating performance across projects. Consider these as building blogs for your data science projects.

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Mastering Multiple Datasets with Feature Discovery appeared first on DataRobot AI Platform.

]]>
Multi-Model Analysis https://www.datarobot.com/ai-accelerators/multi-model-analysis/ Thu, 22 Feb 2024 16:52:05 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53688 This accelerator shares several Python functions which can take the DataRobot insights - specifically model error, feature effects (partial dependence), and feature importance (Shap or permutation-based) and bring them together into one chart, allowing you to understand all of your models in one place and more easily share your findings with stakeholders.

The post Multi-Model Analysis appeared first on DataRobot AI Platform.

]]>
DataRobot provides many options for evaluating model accuracy. However, when you are working across multiple models or projects, the model comparison may not suit your needs. Especially if you need a clean way to compare three or more models and export that comparison as a .png or .jpg.

This notebook shares three Python functions which can pull out various accuracy metrics, feature impact, and feature effects from multiple models and plot them in one chart.

Outline

  1. Setup: import libraries and connect to DataRobot
  2. Accuracy Python function
  3. Feature Impact Python function
  4. Feature Effects Python function
  5. Example use and outputs

Setup

In [1]:

import datetime as dt
import sys

import datarobot as dr
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Everything below this comment only impacts the charts, not the models nor data.
# Customize as you see fit.
plt.style.use("tableau-colorblind10")
mpl.rcParams["figure.figsize"] = [11.0, 7.0]
mpl.rcParams["font.size"] = 18
mpl.rcParams["figure.titlesize"] = "large"
mpl.rcParams["font.family"] = "serif"
for param in [
    "xtick.bottom",
    "ytick.left",
    "axes.spines.top",
    "axes.spines.right",
    "legend.frameon",
    "legend.fancybox",
]:
    mpl.rcParams[param] = False
mpl.rcParams["figure.facecolor"] = "white"
# for plots with a dark background:
# for param in ['xtick.color', 'ytick.color', 'axes.labelcolor', 'text.color']:
#     mpl.rcParams[param] = 'e6ffff'

Accuracy Python function

For a full list of available accuracy metrics, please visit our documentation. Note that not every project will have every metric. For example, LogLoss is only available for classification problems, not regression problems. You can check the available metrics for your model by looking at dr.Model.metrics.

In [2]:

def plot_accuracy(
    model_dict,
    accuracy_metric_one,
    accuracy_metric_two,
    model_category_name="Model Categories",
    partition="crossValidation",
):
    """
    Collects the accuracy metrics across models provided and plots them on one plot.

    Parameters
    ----------
    model_dict : Dictionary of str keys and DataRobot Model/DatetimeModel object values
    accuracy_metric_one: str indicating the first accuracy metric of interest, such as LogLoss
    accuracy_metric_two: str indicating the second accuracy metric of interest, such as AUC
    model_category_name: str indicating the different categories each model represents
    partition: str indicating the data partition to use for the accuracy metric, such as holdout.
    """

    n_categories = len(model_dict)
    accuracy_scores = {}
    for cat, model in model_dict.items():
        accuracy_scores[cat] = {
            accuracy_metric_one: model.metrics[accuracy_metric_one][partition],
            accuracy_metric_two: model.metrics[accuracy_metric_two][partition],
        }
    accuracy = pd.DataFrame.from_dict(accuracy_scores, orient="index")
    accuracy = accuracy.reset_index().rename(columns={"index": "category"})
    accuracy.sort_values(by="category", inplace=True)

    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.text(0.5, 0.04, model_category_name, ha="center")
    sns.barplot(
        x="category",
        y=accuracy_metric_one,
        data=accuracy,
        hue=["blue"] * n_categories,
        ax=ax1,
    )
    ax1.set_title(accuracy_metric_one)
    sns.barplot(
        x="category",
        y=accuracy_metric_two,
        data=accuracy,
        hue=["blue"] * n_categories,
        ax=ax2,
    )
    ax2.set_title(accuracy_metric_two)
    for ax in (ax1, ax2):
        ax.legend_.remove()
        ax.set(xlabel=None, ylabel=None)
    return fig

Feature Impact Python function

For details on what feature impact is and how to interpret it, please visit our documentation.

In [3]:

def plot_feature_impact(
    model_dict,
    model_category_name="Model Categories",
    feature_map=None,
    impute_features=False,
):
    """
    Collects the feature impact across models provided and plots them on one plot.

    Parameters
    ----------
    model_dict: Dictionary of str keys and DataRobot Model/DatetimeModel object values.
        The keys should be whichever category the value represents.
    model_category_name: str indicating the different categories each model represents
    feature_map (optional): Dictionary of str keys and str values. The key should be the DataRobot feature name
        and the value should be what you want to appear on the plot.
    impute_features: boolean indicating whether features not present in feature_map should be imputed with "Other Features"
    """

    impact_jobs = {}
    for category, model in model_dict.items():
        project = dr.Project.get(model.project_id)
        if project.advanced_options.shap_only_mode == True:
            job = dr.ShapImpact.create(model.project_id, model.id)
        else:
            try:
                job = model.request_feature_impact()
            except dr.errors.JobAlreadyRequested:
                # if you manually queued feature impact outside of this function,
                # you may want to wait for that to finish before running this
                continue
        impact_jobs[category] = job

    feature_impact = []
    for category, model in model_dict.items():
        try:
            impact_jobs[category].wait_for_completion()
        except KeyError:
            pass
        project = dr.Project.get(model.project_id)
        if project.advanced_options.shap_only_mode == True:
            shap = dr.ShapImpact.get(model.project_id, model.id)
            impact = pd.DataFrame(shap.shap_impacts)
        else:
            impact = pd.DataFrame(model.get_feature_impact())
            impact.rename(
                columns={
                    "featureName": "feature_name",
                    "impactUnnormalized": "impact_unnormalized",
                },
                inplace=True,
            )
        try:
            impact["feature"] = impact["feature_name"].map(feature_map)
            if impute_features == True:
                impact["feature"] = impact["feature"].fillna("Other Features")
        except NameError:
            impact["feature"] = impact["feature_name"]
        agg_impact = impact.groupby("feature", as_index=False)[
            "impact_unnormalized"
        ].sum()
        agg_impact["impact_pct_of_total"] = (
            agg_impact["impact_unnormalized"] / agg_impact["impact_unnormalized"].sum()
        )
        agg_impact[model_category_name] = category
        feature_impact.append(agg_impact)
    feature_impact = pd.concat(feature_impact)

    feature_impact.sort_values(
        by=[model_category_name, "feature"], ascending=True, inplace=True
    )

    fig, ax = plt.subplots(1, 1)
    sns.histplot(
        feature_impact,
        x=model_category_name,
        weights="impact_pct_of_total",
        hue="feature",
        multiple="stack",
        ax=ax,
        hue_order=sorted(feature_impact["feature"].unique()),
    )
    ax.get_legend().set_bbox_to_anchor((1, 1))
    ax.legend_.set_title("Feature")
    ax.set(xlabel="Category", ylabel="Percent of Total Impact")
    ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(1.0))
    return fig

Feature Effects Python function

For more details on what feature effects is and how to interpret it, please visit our documentation.

In [4]:

def get_fe_data(model_dict, max_wait=600):
    """
    Collects the feature effects data across models and returns them in one Pandas DataFrame.

    Parameters
    ----------
    model_dict: Dictionary of str keys and DataRobot Model/DatetimeModel object values.
        The keys should be whichever category the value represents.
    """
    fe_jobs = {}
    for category, model in model_dict.items():
        project = dr.Project.get(model.project_id)
        if project.is_datetime_partitioned == True:
            job = model.request_feature_effect(backtest_index="0")
        else:
            job = model.request_feature_effect()
        fe_jobs[category] = job

    fe_list = []
    for category, model in model_dict.items():
        fe = fe_jobs[category].get_result_when_complete(max_wait=max_wait)
        feature_effects = fe.feature_effects
        for feature in feature_effects:
            feature_df = pd.DataFrame(feature["partial_dependence"]["data"])
            feature_df["feature"] = feature["feature_name"]
            feature_df["category"] = category
            feature_df["project_id"] = model.project_id
            feature_df["model_id"] = model.id
            fe_list.append(feature_df)
    fe_df = pd.concat(fe_list)
    return fe_df


def create_fe_plot(data, feature_name, title, xlabel, coltype):
    """
    Plots the feature effects for one feature from each model on one line.
    Numeric plots do not show null values.

    Parameters
    ----------
    data: Pandas DataFrame of the feature effects, from get_fe_data.
    feature_name: str of the feature. Must align to the feature name in the dataset.
    title: str for the title of the plot.
    xlabel: str for the x-axis label of the plot.
    coltype: str for the data type of the column. Must be one of: ['num', 'cat'].
    """
    df = data[data["feature"] == feature_name].copy()
    df.sort_values(by=["category"], inplace=True)
    fig = plt.figure(figsize=(16, 6))

    if coltype == "num":
        df["label"] = df["label"].astype(float)
        df.dropna(subset=["label"], inplace=True)
        ax = sns.lineplot(x="label", y="dependence", hue="category", data=df)
    elif coltype == "cat":
        df.sort_values(by=["category", "label"], inplace=True)
        ax = sns.barplot(x="label", y="dependence", hue="category", data=df)
    else:
        print("Unsupported column type.")
        return
    legend = ax.legend(ncol=2)
    ax.set(xlabel=xlabel, ylabel="Partial Dependence", title=title)
    return fig

Example use and outputs

When using this function, it is important to use the appropriate model. For accuracy, if you have models trained into your holdout, do not use those! The models in your comparison here should have appropriate out-of-sample partitions which can be used for these plots. For feature impact and feature effects, you may use to analyze models for model selection or for the models you intend to use/are using for production.

When using this for your own work, you only need to provide the model dictionary. The code below is to give you an example from scratch, but you may skip it if you want to provide your own dict of models.

In [ ]:

# use this cell if you need example project(s)
# this same example is used across all 3 functions,
# so you only need to run this once! it may take awhile
# skip this cell and go to the next one if you have already run this
data = pd.read_csv(
    "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv",
    encoding="iso-8859-1",
)

adv_opt = dr.AdvancedOptions(prepare_model_for_deployment=False)
project_dict = {}
for grade in data["grade"].unique():
    p = dr.Project.create(
        data[data["grade"] == grade],
        "Multi-Model Accuracy Example, Grade {}".format(grade),
    )
    p.analyze_and_model("is_bad", worker_count=-1, advanced_options=adv_opt)
    print("Project for Grade {} begun.".format(grade))
    project_dict[grade] = p

model_dict = {}
for grade, p in project_dict.items():
    p.wait_for_autopilot(verbosity=0)
    models = p.get_models()
    results = pd.DataFrame(
        [
            {
                "model_type": m.model_type,
                "blueprint_id": m.blueprint_id,
                "cv_logloss": m.metrics["LogLoss"]["crossValidation"],
                "model_id": m.id,
                "model": m,
            }
            for m in models
        ]
    )
    best_model = results["model"].iat[results["cv_logloss"].idxmin()]
    model_dict[grade] = best_model
    print("Project for Grade {} is finished.".format(grade))
In [5]:

# run this cell if you already ran the template example above
# it will be faster than re-running the cell above
projects = dr.Project.list(
    search_params={"project_name": "Multi-Model Accuracy Example, Grade"}
)
model_dict = {}
for p in projects:
    grade = p.project_name[-1]
    if p.is_datetime_partitioned == True:
        models = p.get_datetime_models()
    else:
        models = p.get_models()
    results = pd.DataFrame(
        [
            {
                "model_type": m.model_type,
                "blueprint_id": m.blueprint_id,
                "logloss": m.metrics["LogLoss"]["crossValidation"],
                "model_id": m.id,
                "model": m,
            }
            for m in models
        ]
    )
    best_model = results["model"].iat[results["logloss"].idxmin()]
    model_dict[grade] = best_model
In [6]: 

# This is what the input to these functions should look like
# A dictionary of keys which represent each of the categories you wish to plot,
# and values of the Model objects
# if your project(s)
model_dict
Out [6]: 

{'G': Model('eXtreme Gradient Boosted Trees Classifier'),
 'E': Model('eXtreme Gradient Boosted Trees Classifier'),
 'C': Model('Generalized Additive2 Model'),
 'D': Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 'B': Model('eXtreme Gradient Boosted Trees Classifier'),
 'F': Model('Light Gradient Boosting on ElasticNet Predictions '),
 'A': Model('eXtreme Gradient Boosted Trees Classifier')}

Accuracy

In [7]:

accuracy_plot = plot_accuracy(
    model_dict,
    accuracy_metric_one="LogLoss",
    accuracy_metric_two="AUC",
    model_category_name="Grade",
    partition="validation",
)
# if you wish to export and share the image:
# accuracy_plot.savefig('accuracy.png')
LogLoss

Feature Impact

In [8]:

# if you have a lot of features you want to bucket into an "Other" category,
# you can leave them out of the feature_map dictionary and set impute_feature to True
feature_map = {
    "annual_inc": "Annual Income",
    "desc": "Loan Description",
    "int_rate": "Interest Rate",
    "open_acc": "Open Accounts",
    "title": "Loan Title",
}
feature_impact_plot = plot_feature_impact(
    model_dict,
    model_category_name="Grade",
    feature_map=feature_map,
    impute_features=True,
)
# if you wish to export and share the image:
# feature_impact_plot.savefig('feature_impact.png')
percent of total

Feature Effects

In [9]:

# We pull the data from Feature Effects first, as we can use the same dataset across each feature plot
# If you have a large dataset, you may need to adjust the max_wait parameter within the function
fe_data = get_fe_data(model_dict)
fe_data.head()

Out [9]:

labeldependencefeaturecategoryproject_idmodel_id
000.18968mths_since_last_delinqG640f81a763568ea409b6b595640f821a2d828cd1855d16bd
120.18968mths_since_last_delinqG640f81a763568ea409b6b595640f821a2d828cd1855d16bd
240.18968mths_since_last_delinqG640f81a763568ea409b6b595640f821a2d828cd1855d16bd
350.18968mths_since_last_delinqG640f81a763568ea409b6b595640f821a2d828cd1855d16bd
460.18968mths_since_last_delinqG640f81a763568ea409b6b595640f821a2d828cd1855d16bd
In [10]:

# With this plot, you can see varying effects of income and risk by each grade
# For grade G loans, your default risk actually increases as your income is higher
# This is likely because if you have high annual income and you are grade G, you probably already have a lot of debt or credit problems
annual_inc_plot = create_fe_plot(
    fe_data, "annual_inc", "Annual Income", "annual_inc", "num"
)
# if you wish to export and share the image:
# annual_inc_plot.savefig('fe_annual_inc.png')
download 7
In [11]:

term_plot = create_fe_plot(fe_data, "term", "Loan Term", "term", "cat")
# if you wish to export and share the image:
# term_plot.savefig('fe_term_plot.png')
download 9
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Multi-Model Analysis appeared first on DataRobot AI Platform.

]]>
Model Selection via Custom Metrics https://www.datarobot.com/ai-accelerators/model-selection-via-custom-metrics/ Thu, 22 Feb 2024 16:24:02 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53663 This AI Accelerator demonstrates how one can leverage DataRobot's python client to extract predictions, compute custom metrics, and sort their DataRobot models accordingly.

The post Model Selection via Custom Metrics appeared first on DataRobot AI Platform.

]]>
Overview

When it comes to evaluating machine learning model performance, DataRobot provides many of the standard metrics out-of-the box, either on the Leaderboard or as a model insight. However, depending on the industry, you may need to sort the Leaderboard by a specific metric that is not natively supported by DataRobot, or by return-on-investment (ROI). To help make this process easier, this notebook outlines a way to accomplish this leveraging DataRobot’s Python client for a supervised, binary classification problem.

This notebook outlines the following steps:

  1. Setup: import libraries and connect to DataRobot
  2. Build models with Autopilot
  3. Retrieve predictions and actuals
  4. Sort models by Brier Skill Score (BSS)
  5. Sort models by Rate@Top1%
  6. Sort models by ROI

Setup

First, import the necessary packages and set up the connection to the DataRobot platform.

Import libraries

In [1]:

from typing import Callable, List

from adjustText import adjust_text
import datarobot as dr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import brier_score_loss

print(f"DataRobot version: {dr.__version__}")

DataRobot version: 3.0.2

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

In [2]:

DATAROBOT_ENDPOINT = "https://app.datarobot.com/api/v2"
# The URL may vary depending on your hosting preference, the above example is for DataRobot Managed AI Cloud

DATAROBOT_API_TOKEN = "<INSERT YOUR DataRobot API Token>"
# The API Token can be found by click the avatar icon and then </> Developer Tools

client = dr.Client(
    token=DATAROBOT_API_TOKEN,
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix="AIA-AE-CM-61",  # Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client
Out [2]:

<datarobot.rest.RESTClientObject at 0x10ddda4c0>

Import data

Next, load the following dataset from an Anti-Money Laundering (AML) example into memory. In this dataset, the unit of analysis is an individual alert and the target is binary, indicating if the alert resulted in a Suspicious Activity Report (SAR). SARs take time and money for AML compliance teams to file so being able to identify which alerts to focus (or not focus) on can result in millions of dollars saved per year.

In terms of bringing the dataset into DataRobot, you’ll use dr.Project.create(). While this isn’t the only way, having data already loaded into memory gives you the ability to easily match your predictions back to the actuals – a necessary step for computing metrics outside of DataRobot.

Define parameters

In [2]:

# Defining dataset and target
dataset_location = (
    "https://s3.amazonaws.com/datarobot-use-case-datasets/DR_Demo_AML_Alert_train.csv"
)
target = "SAR"

# Load dataset
df = pd.read_csv(dataset_location)
df

Out [3]:

0123499959996999799989999
ALERT1111111111
SAR0000000000
kycRiskScore3210122012
income110300107800740005770059800650003780015000NaN160700
tenureMonths5613131813534221
creditScore757715751659709699686645701726
statePANYMANJPAPAMAMANYNY
nbrPurchases90d102271454403572
avgTxnSize90d153.81.5957.6429.52115.774.479.9241.3460.7317.58
totalSpend90d153834.98403.48413.286251.58178.829.71206.7425.1135.16
indCustReqRefund90d1111111111
totalRefundsToCust90d45.8267.4450.6971.432731.3949.9250.1833.2734.182830.21
nbrPaymentsCashLike90d5000300004
maxRevolveLine60001000010000800070009000100001000070008000
indOwnsHome0101110000
nbrInquiries1y3335141212
nbrCollections3y0000000000
nbrWebLogins90d68767811510612
nbrPointRed90d1002111001
PEP0000000000

Modeling

Create a project

In[4]:

project = dr.Project.create(sourcedata=df, project_name="Custom Metrics")
project
Out[4]:

Project(Custom Metrics)

Run Autopilot

After creating a project, you can begin building models using DataRobot’s default modeling mode. Note that this is just the default setting, and you can generate more models by specifying one of the other modeling modes.

In [5]:
# Run models
project.analyze_and_model(
    target=target,
    worker_count=-1,  # Setting the worker count to -1 will ensure that you use the max available workers for your account
    max_wait=600,
    advanced_options=dr.AdvancedOptions(
        prepare_model_for_deployment=False
    ),  # Will speed up modeling process
)

# Wait for them to complete
project.wait_for_autopilot()
In progress: 9, queued: 0 (waited: 0s)
In progress: 9, queued: 0 (waited: 0s)
In progress: 9, queued: 0 (waited: 1s)
In progress: 9, queued: 0 (waited: 2s)
In progress: 9, queued: 0 (waited: 3s)
In progress: 9, queued: 0 (waited: 5s)
In progress: 9, queued: 0 (waited: 9s)
In progress: 9, queued: 0 (waited: 16s)
In progress: 9, queued: 0 (waited: 29s)
In progress: 6, queued: 0 (waited: 49s)
In progress: 16, queued: 0 (waited: 70s)
In progress: 14, queued: 0 (waited: 90s)
In progress: 6, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 131s)
In progress: 0, queued: 0 (waited: 151s)

Retrieve models

In [6]:

# List trained models
models = project.get_models()
print(f"Number of models built for this dataset: {len(models)}")
models

Number of models built for this dataset: 9

Out[6]:
[Model('RandomForest Classifier (Gini)'),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('RuleFit Classifier'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Generalized Additive2 Model'),
 Model('Light Gradient Boosting on ElasticNet Predictions '),
 Model('Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)'),
 Model('Elastic-Net Classifier (L2 / Binomial Deviance)'),
 Model('Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)')]

Get predictions and actuals

In order to compute metrics manually, you need both the predictions and the actuals (i.e., the target’s actual values) associated with those predictions. To extract the former, use datarobot.models.Model.request_training_predictions() and for the latter, simply join the actuals from the dataset you have in memory based on the provided “row_id” by DataRobot. Because the predictions for each model will be different, you can save these as new attributes to the model object (below has a convenient function to add them for you). Note that this is just one way to do this. If you pull the models again via datarobot.models.Project.get_models() and overwrite your list of models, you’ll need to re-add the desired information to the Model objects.

In [7]:

# Helper function to get the out-of-sample predictions for a given model


def request_training_predictions(models: List[dr.models.model.Model], data_subset: str):
    """
    Requests (and waits) out-of-sample predictions for a batch of models

    Parameters
    ----------
    models : List of DataRobot models
    data_subset: string indicating data subset (see datarobot.enums.DATA_SUBSET)

    """

    # Request predictions
    jobs = []
    for model in models:
        try:
            jobs.append(model.request_training_predictions(data_partition))

        except:
            pass

    # Wait
    jobs = [x.wait_for_completion(max_wait=60 * 60 * 24) for x in jobs]


def adds_oos_predictions_to_model(
    model: dr.models.model.Model,
    target_series: pd.Series,
    data_subset: str,
):
    """
    Adds the out-of-sample predictions (and relevant information) for a data subset from a model to that model as attributes
    (assumes this is a binary project type)

    Parameters
    ----------
    model : DataRobot model
    target_series: Binary target values for the dataset with the respective row ID as the index
    data_subset: string indicating data subset (see datarobot.enums.DATA_SUBSET)

    """

    # Get project object
    project = dr.Project.get(model.project_id)

    # Asserting target type is met
    assert (
        project.target_type == "Binary"
    ), "This function expects a binary classification project type!"

    # Request or gather predictions
    try:
        training_predict_job = model.request_training_predictions(data_subset)
        training_predictions = training_predict_job.get_result_when_complete(
            max_wait=60 * 60 * 24
        )

    except dr.errors.ClientError:
        training_predictions = [
            tp
            for tp in dr.TrainingPredictions.list(project.id)
            if tp.model_id == model.id and tp.data_subset == data_subset
        ][0]

    # Get as dataframe
    preds = training_predictions.get_all_as_dataframe()

    # Gather predictions and join actuals
    preds = preds.set_index("row_id")
    preds = preds.join(target_series)

    # Define positive class (True / False is stored as float)
    if isinstance(project.positive_class, str):
        positive_class = project.positive_class

    else:
        positive_class = float(project.positive_class)

    # Save information
    model.__y_prob = preds[f"class_{positive_class}"].values
    model.__y_true = preds[target].values
    model.__partition_id = preds["partition_id"].values
    model.__row_id = preds.index
In [8]:

# Unlock the holdout to be used for analysis
project.unlock_holdout()

# Repull models so that holdout metric values are filled in
models = project.get_models()

# Specify the data partition
data_partition = dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT

# Request the out-of-sample predictions
request_training_predictions(models=models, data_subset=data_partition)
In [9]:

# Add this information to your list of models as attributes
for model in models:
    adds_oos_predictions_to_model(
        model=model,
        target_series=df[target],
        data_subset=data_partition,
    )

# Check one of them
print(f"Probabilities: {models[0].__y_prob}")
print(f"Target: {models[0].__y_true}")
print(f"Partition: {models[0].__partition_id}")
print(f"Row: {models[0].__row_id}")
Probabilities: [0.250285   0.21784128 0.         ... 0.0024568  0.00067236 0.        ]
Target: [0 0 0 ... 0 0 0]
Partition: ['Holdout' 'Holdout' 'Holdout' ... 'Holdout' 'Holdout' 'Holdout']
Row: Int64Index([   1,    2,    5,    6,    9,   10,   12,   17,   19,   22,
            ...
            9976, 9985, 9986, 9988, 9989, 9990, 9991, 9992, 9994, 9995],
           dtype='int64', name='row_id', length=3600)

Sort models by Brier Skill Score (BSS)

Now that you have the necessary data, you can begin evaluating the DataRobot models using whatever metrics you’d like. Below is an example using a metric not given to you out-of-the-box by DataRobot, Brier Skill Score), with the help of sklearn’s brier_score_loss() function. Brier Score is similar to Log Loss but is strictly bounded between [0,1] (Log Loss is bounded between [0,+∞)).

Brier Skill Score is an extension of the Brier Score where you compare the Brier Score of a candidate model to a reference model’s to understand how much better (or worse) a model is relatively. For defining the reference model, a common practice is to use the positive class event rate from the dataset you’re evaluating.

In [10]:

# Define the BSS function


def brier_skill_score(
    y_true: np.array, y_prob_candidate: np.array, y_prob_reference: np.array, **kwargs
) -> float:
    """
    Computes Brier Skill Score (the larger the better)

    Parameters
    ----------
    y_true: true labels
    y_prob_candidate: probability predictions from a candidate model
    y_prob_reference: probability predictions from a reference model
    **kwargs: additional arguments to pass to `brier_score_loss()`

    Returns
    -------
    Brier Skill Score value

    References
    -------
    https://en.wikipedia.org/wiki/Brier_score#Brier_Skill_Score_(BSS)

    """

    # Compute Brier Scores
    bs_candidate = brier_score_loss(y_true=y_true, y_prob=y_prob_candidate, **kwargs)
    bs_reference = brier_score_loss(y_true=y_true, y_prob=y_prob_reference, **kwargs)

    return 1 - bs_candidate / bs_reference
In [11]:

# Reference model predictions (the event rate propagated forward)
baseline = np.mean(models[0].__y_true == project.positive_class)
baseline
Out [11]:

0.10277777777777777
In [12]:

# Example
brier_skill_score(
    y_true=models[0].__y_true,
    y_prob_candidate=models[0].__y_prob,
    y_prob_reference=np.repeat(baseline, len(models[0].__y_true)),
    pos_label=project.positive_class,
)
Out [12]:

0.43302782027907316

In [13]:

# An example of how a model stores its performance metric information
models[0].metrics
Out [13]:

{'AUC': {'validation': 0.94542,
  'crossValidation': 0.9446899999999999,
  'holdout': 0.95553,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Area Under PR Curve': {'validation': 0.59093,
  'crossValidation': 0.606878,
  'holdout': 0.67187,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'FVE Binomial': {'validation': 0.51133,
  'crossValidation': 0.5136839999999999,
  'holdout': 0.54902,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Gini Norm': {'validation': 0.89084,
  'crossValidation': 0.8893800000000001,
  'holdout': 0.91106,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Kolmogorov-Smirnov': {'validation': 0.83426,
  'crossValidation': 0.8329359999999999,
  'holdout': 0.83389,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'LogLoss': {'validation': 0.16153,
  'crossValidation': 0.16074799999999997,
  'holdout': 0.14956,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Max MCC': {'validation': 0.58339,
  'crossValidation': 0.586418,
  'holdout': 0.60468,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'RMSE': {'validation': 0.23521,
  'crossValidation': 0.234088,
  'holdout': 0.22327,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Rate@Top10%': {'validation': 0.575,
  'crossValidation': 0.56125,
  'holdout': 0.64,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Rate@Top5%': {'validation': 0.6625,
  'crossValidation': 0.72,
  'holdout': 0.81,
  'training': None,
  'backtestingScores': None,
  'backtesting': None},
 'Rate@TopTenth%': {'validation': 0.5,
  'crossValidation': 0.7,
  'holdout': 1.0,
  'training': None,
  'backtestingScores': None,
  'backtesting': None}}
In [14]:

# Save the BSS score for each model in the same way
for model in models:
    # Compute for each partition
    valid_score = brier_skill_score(
        y_true=model.__y_true[model.__partition_id == "0.0"],
        y_prob_candidate=model.__y_prob[model.__partition_id == "0.0"],
        y_prob_reference=np.repeat(
            baseline, len(model.__y_true[model.__partition_id == "0.0"])
        ),
        pos_label=project.positive_class,
    )
    holdout_score = brier_skill_score(
        y_true=model.__y_true[model.__partition_id == "Holdout"],
        y_prob_candidate=model.__y_prob[model.__partition_id == "Holdout"],
        y_prob_reference=np.repeat(
            baseline, len(model.__y_true[model.__partition_id == "Holdout"])
        ),
        pos_label=project.positive_class,
    )

    # Create new metrics entry (with the same format)
    model.metrics["BSS"] = {
        "validation": valid_score,
        "crossValidation": None,
        "holdout": holdout_score,
        "training": None,
        "backtestingScores": None,
        "backtesting": None,
    }

    print(
        f"{model.model_type}: {round(model.metrics['BSS']['validation'], 4)}, {round(model.metrics['BSS']['holdout'], 4)}"
    )
RandomForest Classifier (Gini): 0.3986, 0.4604
eXtreme Gradient Boosted Trees Classifier with Early Stopping: 0.3985, 0.4692
RuleFit Classifier: 0.3623, 0.4383
Light Gradient Boosted Trees Classifier with Early Stopping: 0.3795, 0.4557
Generalized Additive2 Model: 0.3707, 0.4342
Light Gradient Boosting on ElasticNet Predictions : 0.3494, 0.4357
Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units): 0.3484, 0.3853
Elastic-Net Classifier (L2 / Binomial Deviance): 0.293, 0.2534
Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance): 0.2904, 0.259
In [15]:

# Can inspect BSS alongside DataRobot's metrics
pd.DataFrame(model.metrics).T.dropna(axis=1)

Out[15]:

validationholdout
AUC0.925640.922840
Area Under PR Curve0.545310.513740
FVE Binomial0.357980.340810
Gini Norm0.851280.845680
Kolmogorov-Smirnov0.772060.772760
LogLoss0.212220.218600
Max MCC0.548300.538350
RMSE0.255500.261650
Rate@Top10%0.493750.485000
Rate@Top5%0.637500.540000
Rate@TopTenth%1.000001.000000
BSS0.290410.259016
In [16]:

# Sort the Leaderboard
# Note that you sort by the valdiation only to preserve the integrity of the holdout data
models_sorted_by_bss = sorted(
    models, key=lambda x: x.metrics["BSS"]["validation"], reverse=True
)
models_sorted_by_bss
Out [16]:

[Model('RandomForest Classifier (Gini)'),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Generalized Additive2 Model'),
 Model('RuleFit Classifier'),
 Model('Light Gradient Boosting on ElasticNet Predictions '),
 Model('Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)'),
 Model('Elastic-Net Classifier (L2 / Binomial Deviance)'),
 Model('Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)')]

Sort models by Rate@Top1%

Now that you’ve seen an example of a metric not included in DataRobot, let’s look at another example where you take an existing metric and expand on it. DataRobot’s Rate@Top10%, Rate@Top5%, and Rate@TopTenth% metrics are extremely useful for understanding how many events a model is capturing (in this case SAR alerts) in the highest predictions. Oftentimes, 10%, 5%, and 0.1% are sufficient, but some use cases may require computing different percentages. To this end, below is a function to compute this statistic for any percentage.

In [17]:

# Define Rate@TopX% function


def rate_at_top_x(
    y_true: np.array, y_prob: np.array, percentage: float, positive_class: str
) -> float:
    """
    Computes DataRobot's Rate@TopX% metric for any percentage

    Parameters
    ----------
    y_true: true labels
    y_prob: probability predictions from a model
    percentage: percentage to use for the rate metric
    positive_class: event class associated with model

    Returns
    -------
    Rate@TopX% value

    References
    -------
    https://app.datarobot.com/docs/modeling/reference/model-detail/opt-metric.html#ratetop10-ratetop5-ratetoptenth

    """

    # Ensure percentage is in bounds
    assert 0 <= percentage <= 100, "Percentage needs to be [0, 100]"

    # Make boolean
    actuals_mask = y_true == positive_class

    # Find top X predictions
    top_preds_mask = y_prob >= np.percentile(y_prob, 100 - percentage)

    # To avoid dividing by 0
    if top_preds_mask.sum() == 0:
        return 0.0

    # Compute rate
    return (actuals_mask * top_preds_mask).sum() / top_preds_mask.sum()
In [18]:

# Verify by comparing your metric back to DataRobot on the validation partition
for i in [10, 5, 0.1]:
    # Compute for each partition
    valid_value = rate_at_top_x(
        y_true=models[0].__y_true[models[0].__partition_id == "0.0"],
        y_prob=models[0].__y_prob[models[0].__partition_id == "0.0"],
        percentage=i,
        positive_class=project.positive_class,
    )
    holdout_value = rate_at_top_x(
        y_true=models[0].__y_true[models[0].__partition_id == "Holdout"],
        y_prob=models[0].__y_prob[models[0].__partition_id == "Holdout"],
        percentage=i,
        positive_class=project.positive_class,
    )

    # DataRobot's value
    if i == 0.1:
        dr_valid_value = models[0].metrics["Rate@TopTenth%"]["validation"]
        dr_holdout_value = models[0].metrics["Rate@TopTenth%"]["holdout"]

    else:
        dr_valid_value = models[0].metrics[f"Rate@Top{i}%"]["validation"]
        dr_holdout_value = models[0].metrics[f"Rate@Top{i}%"]["holdout"]

    print(
        f"Computed Rate@Top{i}%: {round(valid_value, 4)}, {round(holdout_value, 4)} | DataRobot's: {dr_valid_value}, {dr_holdout_value}"
    )
Computed Rate@Top10%: 0.575, 0.64 | DataRobot's: 0.575, 0.64
Computed Rate@Top5%: 0.6625, 0.81 | DataRobot's: 0.6625, 0.81
Computed Rate@Top0.1%: 0.5, 1.0 | DataRobot's: 0.5, 1.0
In [19]:

# Compute and save the Rate@Top1%
for model in models:
    # Compute for each partition
    valid_score = rate_at_top_x(
        y_true=model.__y_true[model.__partition_id == "0.0"],
        y_prob=model.__y_prob[model.__partition_id == "0.0"],
        percentage=1,
        positive_class=project.positive_class,
    )
    holdout_score = rate_at_top_x(
        y_true=model.__y_true[model.__partition_id == "Holdout"],
        y_prob=model.__y_prob[model.__partition_id == "Holdout"],
        percentage=1,
        positive_class=project.positive_class,
    )

    # Create new metrics entry (with same format)
    model.metrics["Rate@Top1%"] = {
        "validation": valid_score,
        "crossValidation": None,
        "holdout": holdout_score,
        "training": None,
        "backtestingScores": None,
        "backtesting": None,
    }

    print(
        f"{model.model_type}: {round(model.metrics['Rate@Top1%']['validation'], 4)}, {round(model.metrics['Rate@Top1%']['holdout'], 4)}"
    )
RandomForest Classifier (Gini): 0.75, 0.9
eXtreme Gradient Boosted Trees Classifier with Early Stopping: 0.8125, 0.7
RuleFit Classifier: 0.75, 0.8
Light Gradient Boosted Trees Classifier with Early Stopping: 0.75, 0.75
Generalized Additive2 Model: 0.8125, 0.75
Light Gradient Boosting on ElasticNet Predictions : 0.8125, 0.75
Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units): 0.8125, 0.85
Elastic-Net Classifier (L2 / Binomial Deviance): 0.75, 0.8
Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance): 0.75, 0.75
In [20]:

# Sort the leaderboard
models_sorted_by_rate_at_top1 = sorted(
    models, key=lambda x: x.metrics["Rate@Top1%"]["validation"], reverse=True
)
models_sorted_by_rate_at_top1
Out [20]:

[Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Generalized Additive2 Model'),
 Model('Light Gradient Boosting on ElasticNet Predictions '),
 Model('Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)'),
 Model('RandomForest Classifier (Gini)'),
 Model('RuleFit Classifier'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Elastic-Net Classifier (L2 / Binomial Deviance)'),
 Model('Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)')]

Sort models by ROI

While machine learning metrics like Brier Skill Score and Rate@TopX% are useful for understanding empirical prediction performance, they don’t easily map to business value. Although sometimes difficult, assigning the ROI of utilizing machine learning can be vital for use case adoption and model implementation. Here, you will explore a similar exercise as above but with computing a dollar figure rather than a machine learning metric. Since this is a binary classification problem, 4 possible outcomes exist:

  • True positive (TP): The model correctly predicted that the alert would result in a SAR.
  • True negative (TN): The model correctly predicted that the alert would not result in a SAR.
  • False positive (FP): The model incorrectly predicted that the alert would result in a SAR.
  • False negative (FN): The model incorrectly predicted that the alert would not result in a SAR.

For the sake of example, you can use the same cost estimates described here, which are:

  • Cost of investigating an alert: -$50
  • Cost of remediating a SAR that was not detected: -$200

Given that each alert costs -$50 to review and a financial institution explores each alert, one way to compute ROI is in terms of savings from using a model-based approach:

ROI = cost_with_model – cost_without_model

where

cost_with_model = -($50 * number of alerts flagged + $200 * number of SAR alerts missed) = -(50 * (TPs + FPs) + 200 * FNs)

cost_without_model = -$50 * all alerts

Note that in order to assign ROI to each model, you must first establish a threshold to apply to the predicted probabilities (so that a model can determine if an alert is SAR-worthy or not). Finding this threshold generally involves optimizing some sort of metric (e.g., F1-score). Here, you’ll use the validation partition to find the threshold that maximizes ROI and then use that threshold to determine the ROI on the holdout partition. Each of these values will be recorded as done previously under the “metrics” attribute within the model object.

In [21]:

# Helpers for finding the best threshold according to a supplied payoff matrix


def compute_total_profit(
    payoff_matrix: dr.models.PayoffMatrix,
    true_positive_count: int,
    true_negative_count: int,
    false_positive_count: int,
    false_negative_count: int,
) -> float:
    """
    Computes a value representing total profit

    Parameters
    ----------
    payoff_matrix: a DataRobot payoff matrix
    true_positive_count: number of true positives
    true_negative_count: number of true negatives
    false_positive_count: number of false positives
    false_negative_count: number of false negatives

    Returns
    -------
    A value representing the total profit

    """

    # Compute values
    tp_total = payoff_matrix.true_positive_value * true_positive_count
    tn_total = payoff_matrix.true_negative_value * true_negative_count
    fp_total = payoff_matrix.false_positive_value * false_positive_count
    fn_total = payoff_matrix.false_negative_value * false_negative_count

    return tp_total + tn_total + fp_total + fn_total


def optimize_thresholds_by_total_profit(
    model: dr.models.model.Model,
    profit_function: Callable,
    payoff_matrix: dr.models.PayoffMatrix,
    data_source: str,
) -> tuple:
    """
    Find the threshold that maximizes the value return by profit function in a data partition
 
    Parameters
    ----------
    model : DataRobot model
    profit_function: function to compute profit with (should return a float)
    payoff_matrix: a DataRobot payoff matrix
    data_source: string indicating data source (see datarobot.enums.CHART_DATA_SOURCE)

    Returns
    -------
    Threshold that maximizes total profit and the respective total profit value

    """

    # Leveraging the pre-computed thresholds from DataRobot
    thresholds = model.get_roc_curve(source=data_source).roc_points

    # Cycle through each threshold
    results = {}
    for i in range(len(thresholds)):
        # Assign counts
        true_positive_count = thresholds[i]["true_positive_score"]
        true_negative_count = thresholds[i]["true_negative_score"]
        false_positive_count = thresholds[i]["false_positive_score"]
        false_negative_count = thresholds[i]["false_negative_score"]

        # Pass the confusion matrix counts to the ROI function
        profit = profit_function(
            payoff_matrix=payoff_matrix,
            true_positive_count=true_positive_count,
            true_negative_count=true_negative_count,
            false_positive_count=false_positive_count,
            false_negative_count=false_negative_count,
        )

        # Save results
        results[thresholds[i]["threshold"]] = profit

    # Find threshold with maximum profit
    best_threshold = max(results, key=results.get)

    # Return threshold and profit value
    return best_threshold, results[best_threshold]
In [22]:

# Create payoff matrix (note that negative values represent costs)
# This can be viewed in the UI
payoff_matrix = dr.models.PayoffMatrix.create(
    project_id=project.id,
    name="AML Costs",
    true_positive_value=-50,
    true_negative_value=0,
    false_positive_value=-50,
    false_negative_value=-200,
)

Use the snippet below to run through each model, find the threshold that maximizes total profit on validation partition, apply it to the holdout, and save results.

In [23]:

for model in models:
    # Compute the threshold to use
    threshold, valid_profit = optimize_thresholds_by_total_profit(
        model=model,
        profit_function=compute_total_profit,
        payoff_matrix=payoff_matrix,
        data_source=dr.enums.CHART_DATA_SOURCE.VALIDATION,
    )

    # Apply threshold to holdout
    holdout_info = model.get_roc_curve(
        source=dr.enums.CHART_DATA_SOURCE.HOLDOUT
    ).estimate_threshold(threshold)

    # Compute payoff on holdout
    holdout_profit = compute_total_profit(
        payoff_matrix=payoff_matrix,
        true_positive_count=holdout_info["true_positive_score"],
        true_negative_count=holdout_info["true_negative_score"],
        false_positive_count=holdout_info["false_positive_score"],
        false_negative_count=holdout_info["false_negative_score"],
    )

    # Create new metrics entry (with same format)
    model.metrics["Total Profit"] = {
        "validation": valid_profit,
        "crossValidation": None,
        "holdout": holdout_profit,
        "training": None,
        "backtestingScores": None,
        "backtesting": None,
    }

    # Save threshold to the model
    model.__best_threshold_by_total_profit = threshold

    print(
        f"{model.model_type} (threshold = {round(model.__best_threshold_by_total_profit, 2)}): {model.metrics['Total Profit']['validation']}, {model.metrics['Total Profit']['holdout']}"
    )
RandomForest Classifier (Gini) (threshold = 0.22): -19850.0, -24350.0
eXtreme Gradient Boosted Trees Classifier with Early Stopping (threshold = 0.18): -19350.0, -24400.0
RuleFit Classifier (threshold = 0.14): -19750.0, -24950.0
Light Gradient Boosted Trees Classifier with Early Stopping (threshold = 0.17): -19900.0, -24350.0
Generalized Additive2 Model (threshold = 0.12): -20350.0, -25600.0
Light Gradient Boosting on ElasticNet Predictions  (threshold = 0.11): -20450.0, -25600.0
Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units) (threshold = 0.23): -21150.0, -26800.0
Elastic-Net Classifier (L2 / Binomial Deviance) (threshold = 0.16): -21650.0, -29000.0
Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) (threshold = 0.16): -21550.0, -29000.0
In [24]:

# Now that we have the costs associated with the model, you can now compute ROI for this use case
n_alerts_valid = sum(model.__partition_id == "0.0")
n_alerts_holdout = sum(model.__partition_id == "Holdout")
cost_without_model_valid = -50 * n_alerts_valid
cost_without_model_holdout = -50 * n_alerts_holdout

# Now iterate through to compute the final ROI value
for model in models:
    # Cost with model
    cost_with_model_valid = model.metrics["Total Profit"]["validation"]
    cost_with_model_holdout = model.metrics["Total Profit"]["holdout"]

    # Compute savings (aka ROI)
    valid_savings = cost_with_model_valid - cost_without_model_valid
    holdout_savings = cost_with_model_valid - cost_without_model_holdout

    # Create new metrics entry (with same format)
    model.metrics["ROI"] = {
        "validation": valid_savings,
        "crossValidation": None,
        "holdout": holdout_savings,
        "training": None,
        "backtestingScores": None,
        "backtesting": None,
    }

    print(
        f"{model.model_type} (threshold = {round(model.__best_threshold_by_total_profit, 2)}): {model.metrics['ROI']['validation']}, {model.metrics['ROI']['holdout']}"
    )
RandomForest Classifier (Gini) (threshold = 0.22): 60150.0, 80150.0
eXtreme Gradient Boosted Trees Classifier with Early Stopping (threshold = 0.18): 60650.0, 80650.0
RuleFit Classifier (threshold = 0.14): 60250.0, 80250.0
Light Gradient Boosted Trees Classifier with Early Stopping (threshold = 0.17): 60100.0, 80100.0
Generalized Additive2 Model (threshold = 0.12): 59650.0, 79650.0
Light Gradient Boosting on ElasticNet Predictions  (threshold = 0.11): 59550.0, 79550.0
Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units) (threshold = 0.23): 58850.0, 78850.0
Elastic-Net Classifier (L2 / Binomial Deviance) (threshold = 0.16): 58350.0, 78350.0
Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) (threshold = 0.16): 58450.0, 78450.0
In [25]:

# Now sort the Leaderboard
models_sorted_by_roi = sorted(
    models, key=lambda x: x.metrics["ROI"]["validation"], reverse=True
)
models_sorted_by_roi
Out [25]:

[Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('RuleFit Classifier'),
 Model('RandomForest Classifier (Gini)'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Generalized Additive2 Model'),
 Model('Light Gradient Boosting on ElasticNet Predictions '),
 Model('Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)'),
 Model('Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)'),
 Model('Elastic-Net Classifier (L2 / Binomial Deviance)')]

Conclusion

In this notebook, you explored a way to evaluate DataRobot models using custom metrics. This example demonstrates the amount of flexibility and creativity that you can instill upon your model selection, whether that’s ranking DataRobot models according to a stakeholder metric of interest or by estimated ROI. Below shows the Leaderboard ranking for each of the computed metrics and some supporting visuals.

In [26]:

# Compare the various sorting options:
pd.DataFrame(
    {
        f"model_rank_{project.metric}": [x.model_type for x in models],
        "model_rank_bss": [x.model_type for x in models_sorted_by_bss],
        "model_rank_rate_at_top1": [
            x.model_type for x in models_sorted_by_rate_at_top1
        ],
        "model_rank_roi": [x.model_type for x in models_sorted_by_roi],
    }
)

Out [26]:

model_rank_LogLossmodel_rank_bssmodel_rank_rate_at_top1model_rank_roi
0RandomForest Classifier (Gini)RandomForest Classifier (Gini)eXtreme Gradient Boosted Trees Classifier with…eXtreme Gradient Boosted Trees Classifier with…
1eXtreme Gradient Boosted Trees Classifier with…eXtreme Gradient Boosted Trees Classifier with…Generalized Additive2 ModelRuleFit Classifier
2RuleFit ClassifierLight Gradient Boosted Trees Classifier with E…Light Gradient Boosting on ElasticNet Predicti…RandomForest Classifier (Gini)
3Light Gradient Boosted Trees Classifier with E…Generalized Additive2 ModelKeras Slim Residual Neural Network Classifier …Light Gradient Boosted Trees Classifier with E…
4Generalized Additive2 ModelRuleFit ClassifierRandomForest Classifier (Gini)Generalized Additive2 Model
5Light Gradient Boosting on ElasticNet Predicti…Light Gradient Boosting on ElasticNet Predicti…RuleFit ClassifierLight Gradient Boosting on ElasticNet Predicti…
6Keras Slim Residual Neural Network Classifier …Keras Slim Residual Neural Network Classifier …Light Gradient Boosted Trees Classifier with E…Keras Slim Residual Neural Network Classifier …
7Elastic-Net Classifier (L2 / Binomial Deviance)Elastic-Net Classifier (L2 / Binomial Deviance)Elastic-Net Classifier (L2 / Binomial Deviance)Elastic-Net Classifier (mixing alpha=0.5 / Bin…
8Elastic-Net Classifier (mixing alpha=0.5 / Bin…Elastic-Net Classifier (mixing alpha=0.5 / Bin…Elastic-Net Classifier (mixing alpha=0.5 / Bin…Elastic-Net Classifier (L2 / Binomial Deviance)
In [27]:

# Can inspect all new metrics alongside DataRobot's metrics
pd.DataFrame(model.metrics).T.dropna(axis=1)

Out[27]:

validationholdout
AUC0.925640.922840
Area Under PR Curve0.545310.513740
FVE Binomial0.357980.340810
Gini Norm0.851280.845680
Kolmogorov-Smirnov0.772060.772760
LogLoss0.212220.218600
Max MCC0.548300.538350
RMSE0.255500.261650
Rate@Top10%0.493750.485000
Rate@Top5%0.637500.540000
Rate@TopTenth%1.000001.000000
BSS0.290410.259016
Rate@Top1%0.750000.750000
Total Profit-21550.00000-29000.000000
ROI58450.0000078450.000000
In [28]:
# Add some visuals around these results


def prep_metric_data_for_plotting(
    models: List[dr.models.model.Model], data_source: str
) -> pd.DataFrame:
    """
    Organizing the metric data into a dataframe

    Parameters
    ----------
    models : List of DataRobot models
    data_source: string indicating data source (see datarobot.enums.CHART_DATA_SOURCE)

    Returns
    -------
    Dataframe of metrics with model info in the index

    """

    # To save results
    df_metrics = pd.DataFrame()

    # Cycle through each model and save results
    for model in models:
        # Make into dataframe
        tmp_df = pd.DataFrame(model.metrics).assign(
            model_id=model.id, model_type=model.model_type
        )

        # Subset to requested data source
        tmp_df = tmp_df.loc[tmp_df.index.isin([data_source])]

        # Make model ID index
        tmp_df = tmp_df.set_index(["model_id", "model_type"])

        # Append
        df_metrics = pd.concat([df_metrics, tmp_df])

    return df_metrics


def scatter_plot(
    df: pd.DataFrame,
    x_axis_metric: str,
    y_axis_metric: str,
    label_as_model_type: bool,
    **kwargs,
):
    """
    Scatter plot of two metrics with model info annotated on the points
    
    Parameters
    ----------
    df: output from the function "prep_metric_data_for_plotting()"
    x_axis_metric: name of metric to plot on x-axis
    y_axis_metric: name of metric to plot on y-axis
    label_as_model_type: whether to use the model_type as the label or the model ID
    **kwargs: additional arguments to pass to plotting function

    """

    # Make plot
    df.plot.scatter(x_axis_metric, y_axis_metric, **kwargs)

    # If true, use model type as label
    # Else use model ID
    if label_as_model_type:
        labels = df.index.droplevel(-2)

    else:
        labels = df.index.droplevel(-1)

    # Annotate each data point
    x = df[x_axis_metric].values
    y = df[y_axis_metric].values
    ts = []
    for i, txt in enumerate(labels):
        ts.append(plt.text(x[i], y[i], txt))

    # Space text labels out
    adjust_text(ts, x=x, y=y)

    plt.show()
In [29]:

# Create data for plotting
plot_data = prep_metric_data_for_plotting(
    models=models, data_source=dr.enums.CHART_DATA_SOURCE.VALIDATION
)
plot_data

Out [29]:

model_id63e157785d0bfd338104ba5a63e157785d0bfd338104ba5c63e157785d0bfd338104ba5b63e157785d0bfd338104ba5863e157785d0bfd338104ba5763e157785d0bfd338104ba5963e157785d0bfd338104ba5463e157785d0bfd338104ba5563e157785d0bfd338104ba56
model_typeRandomForest Classifier (Gini)eXtreme Gradient Boosted Trees Classifier with Early StoppingRuleFit ClassifierLight Gradient Boosted Trees Classifier with Early StoppingGeneralized Additive2 ModelLight Gradient Boosting on ElasticNet PredictionsKeras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)Elastic-Net Classifier (L2 / Binomial Deviance)Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
AUC0.945420.946750.940470.942520.941850.939120.935780.923610.92564
Area Under PR Curve0.590930.599940.582640.580870.583390.599060.570990.54250.54531
FVE Binomial0.511330.4990.48670.485820.476590.460320.410710.362680.35798
Gini Norm0.890840.89350.880940.885040.88370.878240.871560.847220.85128
Kolmogorov-Smirnov0.834260.841920.834260.833740.82730.817360.771740.740370.77206
LogLoss0.161530.16560.169670.169960.173010.178390.194780.210660.21222
Max MCC0.583390.597650.586470.588680.576060.578520.553730.535280.5483
RMSE0.235210.235240.242210.238910.240610.244650.244840.255030.2555
Rate@Top10%0.5750.550.550.550.543750.531250.543750.51250.49375
Rate@Top5%0.66250.66250.68750.66250.71250.66250.6250.58750.6375
Rate@TopTenth%0.50.50.50.501111
BSS0.3986310.3984840.3622990.3795280.3706730.3493850.3483520.2929660.29041
Rate@Top1%0.750.81250.750.750.81250.81250.81250.750.75
Total Profit-19850-19350-19750-19900-20350-20450-21150-21650-21550
ROI601506065060250601005965059550588505835058450
In [30]:

# Make some plots
metrics_to_plot = pd.DataFrame(
    {
        "x_axis_metrics": ["LogLoss", "BSS", "Rate@Top1%", "LogLoss"],
        "y_axis_metrics": ["AUC", "FVE Binomial", "Rate@Top10%", "ROI"],
        "label_as_model_type": [True, False, False, True],
    }
)

# Cycle through each combination
plt.style.use("ggplot")
for i in range(metrics_to_plot.shape[0]):
    # Set metrics
    x_axis_metric = metrics_to_plot["x_axis_metrics"].iloc[i]
    y_axis_metric = metrics_to_plot["y_axis_metrics"].iloc[i]
    label_as_model_type = metrics_to_plot["label_as_model_type"].iloc[i]

    # Make plot
    scatter_plot(
        df=plot_data,
        x_axis_metric=x_axis_metric,
        y_axis_metric=y_axis_metric,
        label_as_model_type=label_as_model_type,
        figsize=(15, 5),
        color="green",
        marker="D",
        s=25,
        title=f"Scatter plot of {x_axis_metric} vs. {y_axis_metric}",
        xlabel=x_axis_metric,
        ylabel=y_axis_metric,
    )
scatter plot 1
scatter plot 2
scatter plot 3
scatter plot 4
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Model Selection via Custom Metrics appeared first on DataRobot AI Platform.

]]>
Model Factory with Python Multithreading https://www.datarobot.com/ai-accelerators/model-factory-with-python-multithreading/ Thu, 22 Feb 2024 14:20:59 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53659 This accelerator shows a simple example of how to use the Python threading library to build a model factory.

The post Model Factory with Python Multithreading appeared first on DataRobot AI Platform.

]]>
It is well-known that only one thread can execute Python code at once in CPython (even though certain performance-oriented libraries might overcome this limitation) because of the Global Interpreter Lock (GIL). Despite this disadvantage, multithreading is still an appropriate approach if you want to run multiple I/O-bound tasks simultaneously.

The DataRobot platform makes it possible to create model factories. A model factory is a system or set of procedures that automatically generate predictive models with little to no human intervention. More details can be found here. The third party frameworks from the Python ecosystem can also be used for model factories’ building (for example, Dask). One of the best Dask practices is not to overuse Dask when its distributed parallelism is not really needed, especially if you don’t use large amount of data.

The application of model factories improves the throughput of DataRobot cluster over models’ training phase leveraging better usage of the DataRobot modeling workers. That allows decreasing the training time of the models increasing the efficiency of data science teams who need to train tens and hundreds of different models. Usually the performance gain can reach 2-3 times for the training time in comparison to the sequential project training.

This accelerator shows a simple example of how to use the Python threading library to build a model factory.

Setup

Import dependencies

In [1]:

import concurrent.futures as f
import datetime

import datarobot as dr
from datarobot import AUTOPILOT_MODE

print(dr.__version__)

3.1.1

Set the number of pool workers and the model target.

In [2]:

THREAD_POOL_WORKERS = 5
TARGET = "SalePrice"

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

In [3]:
dr.Client(config_path="drconfig.yaml")
Out [3]:
<datarobot.rest.RESTClientObject at 0x103479c90>

Create a dataset in the AI Catalog

In [4]:
training_dataset_file_path = "https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/house_train_dataset.csv"
training_dataset = dr.Dataset.create_from_url(training_dataset_file_path)

Create a DataRobot project

In [5]:

project = dr.Project.create_from_dataset(
    training_dataset.id, project_name="Sequential Project"
)

Modeling

Start Autopilot for one project

In [6]:

project.analyze_and_model(target=TARGET, worker_count=-1, mode=AUTOPILOT_MODE.QUICK)
Out [6]:

Project(Sequential Project)
In [7]:

project.wait_for_autopilot(check_interval=60)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 8, queued: 0 (waited: 5s)
In progress: 8, queued: 0 (waited: 7s)
In progress: 8, queued: 0 (waited: 11s)
In progress: 8, queued: 0 (waited: 18s)
In progress: 8, queued: 0 (waited: 31s)
In progress: 0, queued: 0 (waited: 58s)
In progress: 0, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 170s)
In progress: 0, queued: 0 (waited: 231s)
In progress: 0, queued: 0 (waited: 292s)

Start Autopilot for one project with advanced options

You can decrease training time if there is no need to prepare model for the deployment and train blenders. It can be useful during the ML experimentation phase.

In [8]:

advanced_options = dr.AdvancedOptions(
    prepare_model_for_deployment=False, blend_best_models=False
)
In [9]:

project = dr.Project.create_from_dataset(
    training_dataset.id, project_name="Sequential Project (advanced options)"
)
In [10]:

project.analyze_and_model(
    target=TARGET,
    worker_count=-1,
    mode=AUTOPILOT_MODE.QUICK,
    advanced_options=advanced_options,
)
Out [10]:

Project(Sequential Project (advanced options))
In [11]:

project.wait_for_autopilot(check_interval=60)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 8, queued: 0 (waited: 4s)
In progress: 8, queued: 0 (waited: 7s)
In progress: 8, queued: 0 (waited: 11s)
In progress: 8, queued: 0 (waited: 18s)
In progress: 8, queued: 0 (waited: 31s)
In progress: 2, queued: 0 (waited: 58s)
In progress: 2, queued: 0 (waited: 110s)
In progress: 0, queued: 0 (waited: 170s)

You can see that the training time decreased from 292s to 170s (42% gain).

Modeling five projects in parallel

Create a list with five DataRobot projects that will be trained in parallel.

In [12]:

project_list = []
for n in range(1, 6):
    project_name = f"Parallel Project - {n}"
    project = dr.Project.create_from_dataset(
        training_dataset.id, project_name=project_name
    )
    project_list.append(project)
print(project_list)
[Project(Parallel Project - 1), Project(Parallel Project - 2), Project(Parallel Project - 3), Project(Parallel Project - 4), Project(Parallel Project - 5)]

This function kicks off an independent training process for every project (5 projects created in this example) in each thread.

In [13]:
def thread_function(project, start_time):
    print(f"Start training of project '{project.project_name}'...\n")
    project.analyze_and_model(
        target=TARGET, worker_count=-1, mode=AUTOPILOT_MODE.QUICK, max_wait=14400
    )
    project.wait_for_autopilot(check_interval=60)

    return datetime.datetime.now() - start_time

Submit tasks to executor

The ThreadPoolExecutor subclass with the predefined number of threads will be used to submit tasks for the asynchronous execution. The context manager should be used for the correct resources’ management.

In [14]:

with f.ThreadPoolExecutor(max_workers=THREAD_POOL_WORKERS) as executor:
    allFutures = {
        executor.submit(thread_function, pr, datetime.datetime.now()): pr
        for pr in project_list
    }

    for future in f.as_completed(allFutures):
        pr = allFutures[future]
        try:
            elapsed_time = future.result()
        except Exception as exc:
            print(
                f"Training of project '{pr.project_name}' generated an exception: {exc}"
            )
        else:
            print(f"Training of project '{pr.project_name}' finished in {elapsed_time}")
Start training of project 'Parallel Project - 1'...

Start training of project 'Parallel Project - 2'...

Start training of project 'Parallel Project - 3'...

Start training of project 'Parallel Project - 4'...

Start training of project 'Parallel Project - 5'...

In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 0s)
In progress: 4, queued: 4 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 4, queued: 4 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 4, queued: 4 (waited: 2s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 4, queued: 4 (waited: 3s)
In progress: 8, queued: 0 (waited: 4s)
In progress: 8, queued: 0 (waited: 5s)
In progress: 4, queued: 4 (waited: 4s)
In progress: 0, queued: 8 (waited: 0s)
In progress: 0, queued: 8 (waited: 0s)
In progress: 0, queued: 8 (waited: 1s)
In progress: 0, queued: 8 (waited: 1s)
In progress: 8, queued: 0 (waited: 7s)
In progress: 8, queued: 0 (waited: 7s)
In progress: 4, queued: 4 (waited: 7s)
In progress: 0, queued: 8 (waited: 2s)
In progress: 0, queued: 8 (waited: 2s)
In progress: 0, queued: 8 (waited: 3s)
In progress: 0, queued: 8 (waited: 3s)
In progress: 0, queued: 8 (waited: 4s)
In progress: 0, queued: 8 (waited: 4s)
In progress: 8, queued: 0 (waited: 11s)
In progress: 8, queued: 0 (waited: 11s)
In progress: 4, queued: 4 (waited: 10s)
In progress: 0, queued: 8 (waited: 7s)
In progress: 0, queued: 8 (waited: 7s)
In progress: 0, queued: 8 (waited: 11s)
In progress: 0, queued: 8 (waited: 11s)
In progress: 8, queued: 0 (waited: 18s)
In progress: 8, queued: 0 (waited: 18s)
In progress: 4, queued: 4 (waited: 18s)
In progress: 0, queued: 8 (waited: 18s)
In progress: 0, queued: 8 (waited: 18s)
In progress: 8, queued: 0 (waited: 31s)
In progress: 8, queued: 0 (waited: 31s)
In progress: 4, queued: 4 (waited: 31s)
In progress: 0, queued: 8 (waited: 31s)
In progress: 0, queued: 8 (waited: 31s)
In progress: 1, queued: 0 (waited: 57s)
In progress: 1, queued: 0 (waited: 58s)
In progress: 5, queued: 0 (waited: 57s)
In progress: 8, queued: 0 (waited: 57s)
In progress: 5, queued: 3 (waited: 57s)
In progress: 6, queued: 10 (waited: 109s)
In progress: 0, queued: 16 (waited: 110s)
In progress: 0, queued: 0 (waited: 109s)
In progress: 3, queued: 0 (waited: 109s)
In progress: 1, queued: 0 (waited: 109s)
In progress: 6, queued: 0 (waited: 170s)
In progress: 0, queued: 0 (waited: 170s)
In progress: 10, queued: 6 (waited: 170s)
In progress: 0, queued: 16 (waited: 170s)
In progress: 0, queued: 16 (waited: 170s)
In progress: 0, queued: 0 (waited: 231s)
In progress: 0, queued: 0 (waited: 231s)
In progress: 0, queued: 0 (waited: 231s)
In progress: 2, queued: 0 (waited: 231s)
In progress: 14, queued: 0 (waited: 231s)
In progress: 0, queued: 0 (waited: 292s)
In progress: 0, queued: 0 (waited: 292s)
In progress: 0, queued: 0 (waited: 291s)
In progress: 0, queued: 0 (waited: 291s)
In progress: 0, queued: 0 (waited: 292s)
In progress: 1, queued: 0 (waited: 352s)
In progress: 1, queued: 0 (waited: 352s)
In progress: 1, queued: 0 (waited: 352s)
In progress: 1, queued: 0 (waited: 352s)
In progress: 1, queued: 0 (waited: 352s)
In progress: 0, queued: 0 (waited: 413s)
In progress: 0, queued: 0 (waited: 413s)
In progress: 0, queued: 0 (waited: 413s)
In progress: 0, queued: 0 (waited: 413s)
In progress: 0, queued: 0 (waited: 413s)
In progress: 0, queued: 0 (waited: 474s)
In progress: 0, queued: 0 (waited: 474s)
In progress: 0, queued: 0 (waited: 474s)
Training of project 'Parallel Project - 2' finished in 0:08:38.625724
Training of project 'Parallel Project - 3' finished in 0:08:38.687918
Training of project 'Parallel Project - 4' finished in 0:08:38.785352
In progress: 0, queued: 0 (waited: 474s)
In progress: 0, queued: 0 (waited: 474s)
Training of project 'Parallel Project - 1' finished in 0:08:43.753684
Training of project 'Parallel Project - 5' finished in 0:08:43.793173

The training time for a multithreaded approach will depend on multiple factors (CPU/RAM load, network bandwidth, etc.) and will vary for the different runs. The average training time is 8min 40s.

Conclusion

Three experiments are performed in this AI Accelerator:

  • Training one project (training time: 292s)
  • Training one project with advanced options (training time: 170s)
  • Training five projects in parallel (training time: 520s)

Training five projects sequentially would take 1460s, while training five projects in parallel took 520s (64% gain i.e. 2.8 times faster). Combining parallel training with advanced project options can also decrease overall training time.

Taking into account the above mentioned numbers, you can conclude that building model factories using multithreaded approach can be really helpful during the ML experimentation phase especially if there is a need to train models for the use cases with thousands of SKUs. The main advantage of the presented approach is an absence of the third party libraries, the full process is based on the Python threading library.

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Model Factory with Python Multithreading appeared first on DataRobot AI Platform.

]]>
Anti-Money Laundering (AML) Alert Scoring https://www.datarobot.com/ai-accelerators/anti-money-laundering-aml-alert-scoring/ Thu, 22 Feb 2024 12:40:53 +0000 https://www.datarobot.com/?post_type=aiaccelerator&p=53640 Our primary goal with this accelerator is to develop a powerful predictive model that utilizes historical customer and transactional data, enabling us to identify suspicious activities and generate crucial Suspicious Activity Reports (SARs).

The model will assign a suspicious activity score to future alerts, improving the effectiveness and efficiency of an AML compliance program by prioritizing alerts based on their ranking order according to the score.

The post Anti-Money Laundering (AML) Alert Scoring appeared first on DataRobot AI Platform.

]]>

The following outlines aspects of this use case.

  • Use case type: Anti-money laundering (false positive reduction)
  • Target audience: Data Scientist, Financial Crime Compliance Team
  • Desired outcomes:
    • Identify customer data and transaction activity indicative of a high risk for potential money laundering.
    • Detect anomalous changes in behavior or emerging money laundering patterns at an early stage.
    • Reduce the false positive rate for cases selected for manual review.
  • Metrics/KPIs:
    • Annual alert volume
    • Cost per alert
    • False positive reduction rate
  • Sample dataset

A crucial aspect of an effective AML compliance program involves monitoring transactions to detect suspicious activity. This encompasses various types of transactions, such as deposits, withdrawals, fund transfers, purchases, merchant credits, and payments. Typically, monitoring begins with a rules-based system that scans customer transactions for signs of potential money laundering. When a transaction matches a predefined rule, an alert is generated, and the case is referred to the bank’s internal investigation team for manual review. If the investigators determine that the behavior is indicative of money laundering, a SAR is filed with FinCEN.

However, the aforementioned standard transaction monitoring system has significant drawbacks. Most notably, the system’s rules-based and inflexible nature leads to a high rate of false positives, with as many as 90% of cases being incorrectly flagged as suspicious. This prevalence of false positives hampers investigators’ efficiency as they are required to manually filter out cases erroneously identified by the rules-based system.

Financial institutions’ compliance teams may have hundreds or even thousands of investigators, and the current systems hinder their effectiveness and efficiency in conducting investigations. The cost of reviewing an alert ranges from $30 to $70. For a bank that receives 100,000 alerts per year, this amounts to a substantial sum. By reducing false positives, potential savings of $600,000 to $4.2 million per year can be achieved.

Key takeaways:

  • Strategy/challenge: Facilitate investigators in focusing their attention on cases with the highest risk of money laundering, while minimizing time spent on reviewing false-positive cases.For banks dealing with a high volume of daily transactions, improving the effectiveness and efficiency of investigations ultimately leads to fewer unnoticed instances of money laundering. This enables banks to strengthen their regulatory compliance and reduce the prevalence of financial crimes within their network.
  • Business driver: Enhance the efficiency of AML transaction monitoring and reduce operational costs.By harnessing their capability to dynamically learn patterns in complex data, machine learning models greatly enhance the accuracy of predicting which cases will result in a SAR filing. Machine learning models for anti-money laundering can be integrated into the review process to score and rank new cases.
  • Model solution: Assign a suspicious activity score to each AML alert, thereby improving the efficiency of an AML compliance program.Any case exceeding a predetermined risk threshold is forwarded to investigators for manual review. Cases falling below the threshold can be automatically discarded or subject to a less intensive review. Once machine learning models are deployed in production, they can be continuously retrained using new data to detect novel money laundering behaviors, incorporating insights from investigator feedback.In particular, the model will employ rules that trigger an alert whenever a customer requests a refund of any amount. Small refund requests can be utilized by money launderers to test the refund mechanism or establish a pattern of regular refund requests for their account.

Work with data

The linked synthetic dataset illustrates a credit card company’s AML compliance program. Specifically the model is detecting the following money-laundering scenarios:

  • Customer spends on the card but overpays their credit card bill and seeks a cash refund for the difference.
  • Customer receives credits from a merchant without offsetting transactions and either spends the money or requests a cash refund from the bank.

The unit of analysis in this dataset is an individual alert, meaning a rule-based engine is in place to produce an alert to detect potentially suspicious activity consistent with the above scenarios.

Problem framing

The target variable for this use case is whether or not the alert resulted in a SAR after manual review by investigators, making this a binary classification problem. The unit of analysis is an individual alert—the model will be built on the alert level—and each alert will receive a score ranging from 0 to 1. The score indicates the probability of the alert being a SAR.

The goal of applying a model to this use case is to lower the false positive rate, which means resources are not spent reviewing cases that are eventually determined to not be suspicious after an investigation.

In this use case, the False Positive Rate of the rules engine on the validation sample (1600 records) is:

Number of SAR=0 divided by the total number of records = 1436/1600 = 90%.

Data preparation

Consider the following when working with data:

  • Define the scope of analysis: Collect alerts from a specific analytical window to start with; it’s recommended that you use 12–18 months of alerts for model building.
  • Define the target: Depending on the investigation processes, the target definition could be flexible. In this walkthrough, alerts are classified as Level1Level2Level3, and Level3-confirmed. These labels indicate at which level of the investigation the alert was closed (i.e., confirmed as a SAR). To create a binary target, treat Level3-confirmed as SAR (denoted by 1) and the remaining levels as non-SAR alerts (denoted by 0).
  • Consolidate information from multiple data sources: Below is a sample entity-relationship diagram indicating the relationship between the data tables used for this use case. 
Consolidate information from multiple data sources

Some features are static information—kyc_risk_score and state of residence for example—these can be fetched directly from the reference tables.

For transaction behavior and payment history, the information will be derived from a specific time window prior to the alert generation date. This case uses 90 days as the time window to obtain the dynamic customer behavior, such as nbrPurchases90davgTxnSize90d, or totalSpend90d.

Features and sample data

The features in the sample dataset consist of KYC (Know-Your-Customer) information, demographic information, transactional behavior, and free-form text information from the customer service representatives’ notes. To apply this use case in your orgaization, your dataset should contain, minimally, the following features:

  • Alert ID
  • Binary classification target (SAR/no-SAR1/0True/False, etc.)
  • Date/time of the alert
  • “Know Your Customer” score used at time of account opening
  • Account tenure, in months
  • Total merchant credit in the last 90 days
  • Number of refund requests by the customer in the last 90 days
  • Total refund amount in the last 90 days

Other helpful features to include are:

  • Annual income
  • Credit bureau score
  • Number of credit inquiries in the past year
  • Number of logins to the bank website in the last 90 days
  • Indicator that the customer owns a home
  • Maximum revolving line of credit
  • Number of purchases in the last 90 days
  • Total spend in the last 90 days
  • Number of payments in the last 90 days
  • Number of cash-like payments (e.g., money orders) in last 90 days
  • Total payment amount in last 90 days
  • Number of distinct merchants purchased at in the last 90 days
  • Customer Service Representative notes and codes based on conversations with customer (cumulative)

Below is an example of one row in the training data after it is merged and aggregated (it is broken into multiple lines for a easier visualization). 

Configure the Python client

The DataRobot API offers a programmatic alternative to the web interface for creating and managing DataRobot projects. It can be accessed through REST or DataRobot’s Python and R clients, supporting Windows, UNIX, and OS X environments. To authenticate with DataRobot’s API, you will need an endpoint and token, as detailed in the documentation. Once you have configured your API credentials, endpoints, and environment, you can leverage the DataRobot API to perform the following actions:

  1. Upload a dataset.
  2. Train a model to learn from the dataset using the Informative Features feature list.
  3. Test prediction outcomes on the model using new data.
  4. Deploy the model.
  5. Predict outcomes using the deployed model and new data.

Import libraries

In [1]:

# NOT required for Notebooks in DataRobot Workbench
# *************************************************
! pip install datarobot --quiet
# Upgrade DR to datarobot-3.2.0b0
# ! pip uninstall datarobot --yes
# ! pip install datarobot --pre

! pip install pandas --quiet
! pip install matplotlib --quiet

import getpass

import datarobot as dr

endpoint = "https://app.eu.datarobot.com/api/v2"
token = getpass.getpass()
dr.Client(endpoint=endpoint, token=token)
# *************************************************

········

Out[1]:

<datarobot.rest.RESTClientObject at 0x7fd37ba9fc40>
In[2]:

import datetime as datetime
import os

import datarobot as dr
import matplotlib.pyplot as plt
import pandas as pd

params = {"axes.titlesize": "8", "xtick.labelsize": "5", "ytick.labelsize": "6"}
plt.rcParams.update(params)

Analyze, clean, and curate data

Preparing data is an iterative process. Even if you have already cleaned and prepped your training data before uploading it, you can further enhance its quality by performing Exploratory Data Analysis (EDA).

In [3]:

# Load the training dataset
df = pd.read_csv(
    "https://s3.amazonaws.com/datarobot-use-case-datasets/DR_Demo_AML_Alert_train.csv",
    encoding="ISO-8859-1",
)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 31 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ALERT                             10000 non-null  int64  
 1   SAR                               10000 non-null  int64  
 2   kycRiskScore                      10000 non-null  int64  
 3   income                            9800 non-null   float64
 4   tenureMonths                      10000 non-null  int64  
 5   creditScore                       10000 non-null  int64  
 6   state                             10000 non-null  object 
 7   nbrPurchases90d                   10000 non-null  int64  
 8   avgTxnSize90d                     10000 non-null  float64
 9   totalSpend90d                     10000 non-null  float64
 10  csrNotes                          10000 non-null  object 
 11  nbrDistinctMerch90d               10000 non-null  int64  
 12  nbrMerchCredits90d                10000 non-null  int64  
 13  nbrMerchCreditsRndDollarAmt90d    10000 non-null  int64  
 14  totalMerchCred90d                 10000 non-null  float64
 15  nbrMerchCreditsWoOffsettingPurch  10000 non-null  int64  
 16  nbrPayments90d                    10000 non-null  int64  
 17  totalPaymentAmt90d                10000 non-null  float64
 18  overpaymentAmt90d                 10000 non-null  float64
 19  overpaymentInd90d                 10000 non-null  int64  
 20  nbrCustReqRefunds90d              10000 non-null  int64  
 21  indCustReqRefund90d               10000 non-null  int64  
 22  totalRefundsToCust90d             10000 non-null  float64
 23  nbrPaymentsCashLike90d            10000 non-null  int64  
 24  maxRevolveLine                    10000 non-null  int64  
 25  indOwnsHome                       10000 non-null  int64  
 26  nbrInquiries1y                    10000 non-null  int64  
 27  nbrCollections3y                  10000 non-null  int64  
 28  nbrWebLogins90d                   10000 non-null  int64  
 29  nbrPointRed90d                    10000 non-null  int64  
 30  PEP                               10000 non-null  int64  
dtypes: float64(7), int64(22), object(2)
memory usage: 2.4+ MB

The sample data contains the following features:

  1. ALERT: Alert Indicator
  2. SAR: Target variable, SAR Indicator
  3. kycRiskScore: Account relationship (Know Your Customer) score used at time of account opening
  4. income: Annual income
  5. tenureMonths: Account tenure in months
  6. creditScore: Credit bureau score
  7. state: Account billing address state
  8. nbrPurchases90d: Number of purchases in last 90 days
  9. avgTxnSize90d: Average transaction size in last 90 days
  10. totalSpend90d: Total spend in last 90 days
  11. csrNotes: Customer Service Representative notes and codes based on conversations with customer
  12. nbrDistinctMerch90d: Number of distinct merchants purchased at in last 90 days
  13. nbrMerchCredits90d: Number of credits from merchants in last 90 days
  14. nbrMerchCreditsRndDollarAmt90d: Number of credits from merchants in round dollar amounts in last 90 days
  15. totalMerchCred90d: Total merchant credit amount in last 90 days
  16. nbrMerchCreditsWoOffsettingPurch: Number of merchant credits without an offsetting purchase in last 90 days
  17. nbrPayments90d: Number of payments in last 90 days
  18. totalPaymentAmt90d: Total payment amount in last 90 days
  19. overpaymentAmt90d: Total amount overpaid in last 90 days
  20. overpaymentInd90d: Indicator that account was overpaid in last 90 days
  21. nbrCustReqRefunds90d: Number refund requests by the customer in last 90 days
  22. indCustReqRefund90d: Indicator that customer requested a refund in last 90 days
  23. totalRefundsToCust90d: Total refund amount in last 90 days
  24. nbrPaymentsCashLike90d: Number of cash-like payments (e.g., money orders) in last 90 days
  25. maxRevolveLine: Maximum revolving line of credit
  26. indOwnsHome: Indicator that the customer owns a home
  27. nbrInquiries1y: Number of credit inquiries in the past year
  28. nbrCollections3y: Number of collections in the past year
  29. nbrWebLogins90d: Number of logins to the bank website in the last 90 days
  30. nbrPointRed90d: Number of loyalty point redemptions in the last 90 days
  31. PEP: Politically Exposed Person indicator
In [4]:

# Upload a dataset
ct = datetime.datetime.now()
file_name = f"AML_Alert_train_{int(ct.timestamp())}.csv"
dataset = dr.Dataset.create_from_in_memory_data(data_frame=df, fname=file_name)
dataset
Out [4]:

Dataset(name='AML_Alert_train_1687350171.csv', id='6492eb9c1e1e2e52c305e3ca')

While a dataset is being registered in Workbench, DataRobot also performs EDA1 analysis and profiling for every feature to detect feature types, automatically transform date-type features, and assess feature quality. Once registration is complete, you can view the exploratory data insights uncovered while computing EDA1, as detailed in the documentation.

Based on the exploratory data insights above, you can draw the following quick observations:

  1. The entire population of interest comprises only alerts, which aligns with the problem’s focus.
  2. The false positive alerts (SAR=0) account for approximately 90%, which is typical for AML problems.
  3. Some features, such as PEP, do not offer any useful information as they consist entirely of zeroes or have a single value.
  4. Certain features, like nbrPaymentsCashLike90d, exhibit signs of zero inflation.
  5. There is potential to convert certain numerical features, such as indOwnsHome, into categorical features.

Additionally, DataRobot automatically detects and addresses common data quality issues with minimal or no user intervention. For instance, a binary column is automatically added within a blueprint to flag rows with excess zeros. This allows the model to capture potential patterns related to abnormal values. No further user action is required.

Create and manage experiments

Experiments are the individual “projects” within a Use Case. They allow you to vary data, targets, and modeling settings to find the optimal models to solve your business problem. Within each experiment, you have access to its Leaderboard and model insights, as well as experiment summary information.

In [5]:

# Create a new project based on a dataset
ct = datetime.datetime.now()
project_name = f"Anti Money Laundering Alert Scoring_{int(ct.timestamp())}"
project = dataset.create_project(project_name=project_name)
print(
    f"""Project Details
Project URL: {project.get_uri()}
Project ID: {project.id}
Project Name: {project.project_name}
    """
)
Project Details
Project URL: https://app.eu.datarobot.com/projects/6492ebd2b83ed3cc6ec5bb2e/models
Project ID: 6492ebd2b83ed3cc6ec5bb2e
Project Name: Anti Money Laundering Alert Scoring_1687350226

Start modeling

In [6]:

# Select modeling parameters and start the modeling process
project.analyze_and_model(target="SAR", mode=dr.AUTOPILOT_MODE.QUICK, worker_count="-1")

project.wait_for_autopilot(check_interval=20.0, timeout=86400, verbosity=0)

Evaluate experiments

As you proceed with modeling, Workbench generates a model Leaderboard, a ranked list of models that facilitates quick evaluation. The models on the Leaderboard are ranked based on the selected optimization metrics, such as LogLoss in this case.

Autopilot, DataRobot’s “survival of the fittest” modeling mode, automatically selects the most suitable predictive models for the specified target feature and trains them with increasing sample sizes. Autopilot not only identifies the best-performing models but also recommends a model that excels at predicting the target feature SAR. The model selection process considers a balance of accuracy, metric performance, and model simplicity. For a detailed understanding, please refer to the Model recommendation process Model recommendation process description.

Within the Leaderboard, you can click on a specific model to access visualizations for further exploration, as outlined in the documentation.

download 5

Lift Chart

The Lift Chart above shows how effective the model is at separating the SAR and non-SAR alerts. After an alert in the out-of-sample partition gets scored by the trained model, it will be assigned with a risk score that measures the likelihood of the alert being a SAR risk, or becoming a SAR.

In the Lift Chart, alerts are sorted based on the SAR risk, broken down into 10 deciles, and displayed from lowest to the highest. For each decile, DataRobot computes the average predicted SAR risk (blue plus) as well as the average actual SAR event (orange circle) and depicts the two lines together. For the recommended model built for this false positive reduction use case, the SAR rate of the top decile is about 65%, which is a significant lift from the ~10% SAR rate in the training data. The top three deciles capture almost all SARs, which means that the 70% of alerts with very low predicted SAR risk rarely result in a SAR.

ROC Curve

Once you have confidence that the model is performing well, select an explicit threshold to make a binary decision based on the continuous SAR risk predicted by DataRobot. To pick up the optimal threshold, there are three important criteria:

  1. The false negative rate has to be as small as possible. False negatives are the alerts that DataRobot determines are not SARs which then turn out to be true SARs. Missing a true SAR is very dangerous and would potentially result in an MRA (matter requiring attention) or regulatory fine. This example takes a conservative approach to have a 0 false negative rate, meaning all true SARs are captured. To achieve this, the threshold has to be low enough to capture all the SARs.
  2. Keep the alert volume as low as possible to reduce enough false positives. In this context, all alerts generated in the past that are not SARs are the de-facto false positives; the machine learning model is likely to assign a lower score to those non-SAR alerts. Therefore, pick a high enough threshold to reduce as many false positive alerts as possible.
  3. Ensure the selected threshold is not only working on the seen data, but also on the unseen data. This is required so that when the model is deployed to the transaction monitoring system for on-going scoring, it can still reduce false positives without missing any SARs.

From experimenting with different choices of thresholds using the cross-validation data (the data used for model training and validation), it seems that 0.03 is the optimal threshold since it satisfies the first two criteria. On one hand, the false negative rate is 0; on the other hand, the alert volume is reduced from 8000 to 2098 (False Positive + True Positive), meaning the number of investigations are reduced by 73% (5902/8000) without missing any SARs.

For the third criterion—setting the threshold to work on unseen alerts—you can quickly validate it in DataRobot. By changing the Data Selection dropdown to Holdout, and applying the same threshold (0.03), the false negative rate remains 0 and the reduction in investigations is still 73% (1464/2000). This proves that the model generalizes well and will perform as expected on unseen data.

ROC curve

Model insights

DataRobot offers a comprehensive suite of powerful tools and features designed to facilitate the interpretation, explanation, and validation of the factors influencing a model’s predictions. One such tool is Feature Impact, which provides a high-level visualization that identifies the features that have the strongest influence on the model’s decisions. A large impact indicates that removing this feature would significantly deteriorate the model’s performance. On the other hand, features with lower impact may have relatively less importance individually but can still contribute to the overall predictive power of the model.

Predict and deploy

Once you identify the model that best learns patterns in your data to predict SARs, DataRobot makes it easy to deploy the model into your alert investigation process. This is a critical step for implementing the use case, as it ensures that predictions are used in the real world to reduce false positives and improve efficiency in the investigation process. The following sections describe activities related to preparing and then deploying a model.

The following applications of the alert-prioritization score from the false positive reduction model both automate and augment the existing rule-based transaction monitoring system.

  • If the FCC (Financial Crime Compliance) team is comfortable with removing the low-risk alerts (very low prioritization score) from the scope of investigation, then the binary threshold selected during the model building stage will be used as the cutoff to remove those no-risk alerts. The investigation team will only investigate alerts above the cutoff, which will still capture all the SARs based on what was learned from the historical data.
  • Often regulatory agencies will consider auto-closure or auto-removal as an aggressive treatment to production alerts. If auto-closing is not the ideal way to use the model output, the alert prioritization score can still be used to triage alerts into different investigation processes, hence improving the operational efficiency.

See the deep dive at the end of this use case for information on decision process considerations.

You can use the following code to return the Recommended for Deployment model to use for model predictions.

In [7]:

model = dr.ModelRecommendation.get(project.id).get_model()
model
Out [7]:

Model('RandomForest Classifier (Gini)')

Compute predictions before deployment

By uploading an external dataset, you can ensure consistent performance in production prior to deployment. This new data will need to have the same transformations applied to the training data.

You can use the UI and follow the five steps of the workflow for testing predictions. When predictions are complete, you can save prediction results to a CSV file.

With the following code, you can obtain more detailed results including predictions, probability of class_1 (positive_probability), probability of class_0 (autogenerated), actual values of the target (SAR), and all features. Furthermore, you can compute Prediction Explanations on this external dataset (which was not part of training data).

In [10]:

# Load an alert dataset for predictions
df_score = pd.read_csv(
    "https://s3.amazonaws.com/datarobot-use-case-datasets/DR_Demo_AML_Alert_pred.csv",
    encoding="ISO-8859-1",
)

# Get the recommended model
model_rec = dr.ModelRecommendation.get(project.id).get_model()
model_rec.set_prediction_threshold(0.03)

# Upload a scoring data set to DataRobot
prediction_dataset = project.upload_dataset(df_score.drop("SAR", axis=1))
predict_job = model_rec.request_predictions(prediction_dataset.id)

# Make predictions
predictions = predict_job.get_result_when_complete()

# Display prediction results
results = pd.concat(
    [predictions.drop("row_id", axis=1), df_score.drop("ALERT", axis=1)], axis=1
)
results.head()
01234
prediction00011
positive_probability0000.1209180.407422
prediction_threshold0.030.030.030.030.03
class_0.01110.8790820.592578
class_1.00000.1209180.407422
SAR00011
kycRiskScore23322
income54400100100598004110052100
tenureMonths1437053
creditScore681702681718704
indCustReqRefund90d11111
totalRefundsToCust90d30.8665.7225.342828.512778.84
nbrPaymentsCashLike90d00024
maxRevolveLine1000013000140001500011000
indOwnsHome00111
nbrInquiries1y41433
nbrCollections3y00000
nbrWebLogins90d86863
nbrPointRed90d20112
PEP00000

Look at the results above. Since this is a binary classification problem:

  • As the positive_probability approaches zero, the row is a stronger candidate for class_0 with prediction value of 0 (the alert is not SAR).
  • As positive_probability approaches one, the outcome is more likely to be of class_1 with prediction value of 1 (the alert is SAR).

From the KDE (Kernel Density Estimate) plot below, you can see that this sample of the data is weighted more strongly toward class_0 (the alert is not SAR); the Probability Density for predictions is close to actuals.

In [11]:

plt_kde = results[["positive_probability", "SAR"]].plot.kde(
    xlim=(0, 1), title="Prediction Distribution"
)
prediction distribution
In [12]:

# Prepare Prediction Explanations
pe_init = dr.PredictionExplanationsInitialization.create(project.id, model_rec.id)
pe_init.wait_for_completion()

Computing Prediction Explanations is a resource-intensive task. You can set a maximum number of explanations per row and also configure prediction value thresholds to speed up the process.

Considering the prediction distribution above, set the threshold_low to 0.2 and threshold_high to 0.5. This will provide Prediction Explanations only for those extreme predictions where positive_probability is lower than 0.2 or higher than 0.5.

In [13]:

# Compute Prediction Explanations with a custom config
number_of_explanations = 3
pe_comput = dr.PredictionExplanations.create(
    project.id,
    model_rec.id,
    prediction_dataset.id,
    max_explanations=number_of_explanations,
    threshold_low=0.2,
    threshold_high=0.5,
)
pe_result = pe_comput.get_result_when_complete()
explanations = pe_result.get_all_as_dataframe().drop("row_id", axis=1).dropna()
display(explanations.head())
01235
prediction00010
class_0_label00000
class_0_probability1110.8790820.98379
class_1_label11111
class_1_probability0000.1209180.01621
explanation_0_featuretotalSpend90davgTxnSize90dtotalSpend90dnbrCustReqRefunds90davgTxnSize90d
explanation_0_feature_value216.2514.92488.88495.55
explanation_0_label11111
explanation_0_qualitative_strength
explanation_0_strength-3.210206-3.834376-3.20076-1.514812-0.402981
explanation_1_featurenbrPaymentsCashLike90dtotalSpend90dnbrPaymentsCashLike90dcsrNotescsrNotes
explanation_1_feature_value0775.840billing address plastic replace moneyordercustomer call statement moneyorder
explanation_1_label11111
explanation_1_qualitative_strength++
explanation_1_strength-2.971257-3.261914-3.031098-0.7084360.390769
explanation_2_featurecsrNotestotalMerchCred90dtotalMerchCred90davgTxnSize90dnbrPurchases90d
explanation_2_feature_valuecard replace statement customer call statement80.471715.6196
explanation_2_label11111
explanation_2_qualitative_strength
explanation_2_strength-2.819563-2.982999-2.990864-0.141831-0.329526

The following code lets you see how often various features are showing up as the top explanation for impacting the probability of SAR.

In [14]:

from functools import reduce

# Create a combined histogram of all the explanations
explanations_hist = reduce(
    lambda x, y: x.add(y, fill_value=0),
    (
        explanations["explanation_{}_feature".format(i)].value_counts()
        for i in range(number_of_explanations)
    ),
)

plt_expl = explanations_hist.plot.bar()
download 6

Having seen the model’s Feature Impact insight earlier, the high occurrence of totalSpend90doverPaymentAmt90d, and totalMerchCred90d as Prediction Explanations is not entirely surprising. These were some of the top-ranked features in the impact chart.

Deploy a model and monitor performance

The DataRobot platform offers a wide variety of deployment methods, among which the most direct route is deploying a model from the Leaderboard. When you create a deployment from the Leaderboard, DataRobot automatically creates a model package for the deployed model. You can access the model package at any time in the Model Registry. For more details, see the documentation for deploying from the Leaderboard. The programmatic alternative to create deployments can be implemented by the code below.

DataRobot will continuously monitor the model deployed on the dedicated prediction server. With DataRobot MLOps, the modeling team can monitor and manage the alert prioritization model by tracking the distribution drift of the input features as well as the performance deprecation over time.

In [15]:

pred_serv_id = dr.PredictionServer.list()[0].id
deployment = dr.Deployment.create_from_learning_model(
    model_id=model_rec.id,
    label="Anti Money Laundering Alert Scoring",
    description="Anti Money Laundering Alert Scoring",
    default_prediction_server_id=pred_serv_id,
)
deployment
Out [15]:

Deployment(Anti Money Laundering Alert Scoring)

When you select a deployment from the Deployments Inventory, DataRobot opens to the Overview page for that deployment, which provides a model and environment specific summary that describes the deployment, including the information you supplied when creating the deployment and any model replacement activity.

The Service Health tab tracks metrics about a deployment’s ability to respond to prediction requests quickly and reliably. This helps identify bottlenecks and assess capacity, which is critical to proper provisioning.

The Data Drift tab provides interactive and exportable visualizations that help identify the health of a deployed model over a specified time interval.

Implementation risks

When operationalizing this use case, consider the following, which may impact outcomes and require model re-evaluation:

  • Change in the transactional behavior of the money launderers.
  • Novel information introduced to the transaction, and customer records that are not seen by the machine learning models.

Deep dive: Imbalanced targets

In AML and Transaction Monitoring, the SAR rate is usually very low (1%–5%, depending on the detection scenarios); sometimes it could be even lower than 1% in extremely unproductive scenarios. In machine learning, such a problem is called class imbalance. The question becomes, how can you mitigate the risk of class imbalance and let the machine learn as much as possible from the limited known-suspicious activities?

DataRobot offers different techniques to handle class imbalance problems. Some techniques:

  • Evaluate the model with different metrics. For binary classification (the false positive reduction model here, for example), LogLoss is used as the default metric to rank models on the Leaderboard. Since the rule-based system is often unproductive, which leads to very low SAR rate, it’s reasonable to take a look at a different metric, such as the SAR rate in the top 5% of alerts in the prioritization list. The objective of the model is to assign a higher prioritization score with a high risk alert, so it’s ideal to have a higher rate of SAR in the top tier of the prioritization score. In the example shown in the image below, the SAR rate in the top 5% of prioritization score is more than 70% (original SAR rate is less than 10%), which indicates that the model is very effective in ranking the alert based on the SAR risk.
  • DataRobot also provides flexibility for modelers when tuning hyperparameters which could also help with the class imbalance problem. In the example below, the Random Forest Classifier is tuned by enabling the balance_boostrap (random sample equal amount of SAR and non-SAR alerts in each decision trees in the forest); you can see the validation score of the new ‘Balanced Random Forest Classifier’ model is slightly better than the parent model.
  • You can also use Smart Downsampling (from the Advanced Options tab) to intentionally downsample the majority class (i.e., non-SAR alerts) in order to build faster models with similar accuracy.

Deep Dive: Decision process

A review process typically consists of a deep-dive analysis by investigators. The data related to the case is made available for review so that the investigators can develop a 360-degree view of the customer, including their profile, demographic, and transaction history. Additional data from third-party data providers, and web crawling, can supplement this information to complete the picture.

For transactions that do not get auto-closed or auto-removed, the model can help the compliance team create a more effective and efficient review process by triaging their reviews. The predictions and their explanations also give investigators a more holistic view when assessing cases.

Risk-based alert triage

Based on the prioritization score, the investigation team could take different investigation strategies. For example:

  • No-risk or low-risk alerts can be reviewed on a quarterly basis, instead of monthly. The frequently alerted entities without any SAR risk can then be reviewed once every three months, which will significantly reduce the time of investigation.
  • High-risk alerts with higher prioritization scores can have their investigation fast-tracked to the final stage in the alert escalation path. This will significantly reduce the effort spent on level 1 and level 2 investigation.
  • Medium-risk alerts can use standard investigation process.

Smart alert assignment

For an alert investigation team that is geographically dispersed, the alert prioritization score can be used to assign alerts to different teams in a more effective manner. High-risk alerts can be assigned to the team with the most experienced investigators while low risk alerts can be handled by a less experienced team. This mitigates the risk of missing suspicious activities due to lack of competency with alert investigations.

For both approaches, the definition of high/medium/low risk could be either a set of hard thresholds (for example, High: score>=0.5, Medium: 0.5>score>=0.3, Low: score<0.3), or based on the percentile of the alert scores on a monthly basis (for exxample, High: above 80th percentile, Medium: between 50th and 80th percentile, Low: below 50th percentile).

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

The post Anti-Money Laundering (AML) Alert Scoring appeared first on DataRobot AI Platform.

]]>