Reinforcement Learning in DataRobot

Horizontal

In this notebook, we implement a very simple model based on the Q-learning algorithm. This notebook is intended to show a basic form of RL that doesn't require a deep understanding of neural networks or advanced mathematics and how one might deploy such a model in DataRobot.

Request a Demo

This example shows the Grid World problem, where an agent learns to navigate a grid to reach a goal.

The notebook will go through the following steps:

  1. Define State and Action Space
  2. Create a Q-table to store expected rewards for each state/action combination
  3. Implement learning algorithm and train model
  4. Evaluate model
  5. Deploy to a DataRobot Rest API end-point

1. Define State and Action Space

Let’s first install datarobotx for some convenient DataRobot deployment procedures.

In [ ]:

%%bash
pip install -U datarobotx
In [ ]:

import random

import numpy as np
In [ ]:

# Grid settings
grid_size = 4

# funtion to build list of all state tuples


def build_state_list(grid_size):
    state_list = []
    for i in range(grid_size):
        for j in range(grid_size):
            state_list.append((i, j))
    return state_list


all_states = build_state_list(grid_size)

# Here we just try to reach the top right corner (could be center or any other state)
goal_state = (3, 3)
n_states = grid_size * grid_size
n_actions = 4  # Up, Down, Left, Right

2. Create a Q-table to store expected rewards for each state/action combination

In [ ]:

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

# Helper functions


def state_to_index(state):
    return state[0] * grid_size + state[1]


def index_to_state(index):
    return (index // grid_size, index % grid_size)


def get_possible_actions(state):
    actions = []
    if state[0] > 0:
        actions.append(0)  # Up
    if state[0] < grid_size - 1:
        actions.append(1)  # Down
    if state[1] > 0:
        actions.append(2)  # Left
    if state[1] < grid_size - 1:
        actions.append(3)  # Right
    return actions


# Correct the state transition function to prevent invalid states


def take_action(state, action):
    new_state = list(state)
    if action == 0 and state[0] > 0:
        new_state[0] -= 1  # Up
    if action == 1 and state[0] < grid_size - 1:
        new_state[0] += 1  # Down
    if action == 2 and state[1] > 0:
        new_state[1] -= 1  # Left
    if action == 3 and state[1] < grid_size - 1:
        new_state[1] += 1  # Right
    return tuple(new_state)

3. Implement learning algorithm and train model

In [ ]:

# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1  # Exploration rate
n_episodes = 100000

# Training the model with corrected state transitions
for episode in range(n_episodes):
    # start at a random state
    state = random.choice(all_states)
    done = state == goal_state

    while not done:
        state_index = state_to_index(state)
        if random.uniform(0, 1) < epsilon:
            # Explore: choose a random action
            action = random.choice(get_possible_actions(state))
        else:
            # Exploit: choose the best action from Q-table
            action = np.argmax(Q[state_index])

        # Take action and observe reward
        next_state = take_action(state, action)
        reward = 1 if next_state == goal_state else 0
        next_state_index = state_to_index(next_state)

        # Q-learning update
        Q[state_index, action] = Q[state_index, action] + learning_rate * (
            reward
            + discount_factor * np.max(Q[next_state_index])
            - Q[state_index, action]
        )

        # Transition to the next state
        state = next_state
        done = state == goal_state

4. Evaluate model

First, we will just show one path then see on average how many actions it takes to get to the goal state.

In [ ]:

# Evaluating the model
state = random.choice(all_states)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])  # Choose the best action
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)
Out [ ]:

Initial state: (3, 3)
[(3, 3)]
In [ ]:

total_actions = 0  # Total number of actions taken to reach the goal
for state in all_states:
    # Evaluating the model
    trajectory = [state]
    done = state == goal_state
    while not done:
        state_index = state_to_index(state)
        action = np.argmax(Q[state_index])  # Choose the best action
        state = take_action(state, action)
        trajectory.append(state)
        done = state == goal_state
        total_actions += 1
print(
    "Average number of actions taken to reach the goal:",
    total_actions / len(all_states),
)
Out [ ]:

Average number of actions taken to reach the goal: 3.0
Is this optimal? We know the optimal policy is to move up or to the right until we reach the goal. From the bottom left, this is 6 actions, for the next 2 states it is 5 actions, for the next 3 it is 4 actions, then 4->3, 3->2, 2->1, 1->0 as we already start at the goal state. By simple arithmetic we have

6+2*5+3*4+4*3+3*2+2*1 = 48

Total state = 16

Therefore, the optimal is 48/16 = 3 which is exactly our average number of actions.

5. Deploy to DataRobot Rest API end-point

In [ ]:

import pickle

import datarobot as dr
import numpy as np
import pandas as pd
In [ ]:

import os

os.makedirs("./storage/deploy/", exist_ok=True)
# save the Q table to a pickle file
with open("./storage/deploy/q_table.pkl", "wb") as f:
    pickle.dump(Q, f)

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

In [ ]:

dr_client = dr.Client()

Define Hooks for Deploying an Unstructured Custom Model. One could use a standard custom deployment, but using this to illustrate flexibity for more complex RL problems.

In [ ]:

def load_model(input_dir):
    """Custom model hook for loading our Q-table

    Make sure to execute the cell earlier in the notebook that create Q-table before deploying
    """

    with open(input_dir + "/storage/deploy/" + "q_table.pkl", "rb") as f:
        Q = pickle.load(f)

    return Q


def score_unstructured(model, data, query, **kwargs) -> str:
    """Custom model hook for return action.

    model: The output of load_model is passed to this object
    data: str
        Expects json string passed in request body.
        Required keys:
                state: tuple(int, int) .. Current state of the agent
    query: None
        Unused
    **kwargs: dict
        Unused

    Returns:
        JSON string with output action

    """
    import json

    import numpy as np

    Q = model
    grid_size = int(np.sqrt(len(Q)))  # Grid size is inferred from the Q-table

    # Helper functions
    def state_to_index(state):
        return state[0] * grid_size + state[1]
    
    data_dict = json.loads(data)
    state = data_dict["state"]

    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])

    return json.dumps({"action": action}, default=int)

Test out the prediction structure proior to deployment.

In [ ]:

import json

score_unstructured(
    load_model("."),
    json.dumps({"state": (0, 1)}),
    None,
)
Out [ ]:

'{"action": 1}'

Deploy the RL policy model. We will use this convenience method in drx.

  • Builds a new Custom Model Environment
  • You can also use a DataRobot Python Drop-in Enviroment (e.g. “6386dc1159c606b0d8beddc7”)
  • Assembles a new Custom Model with the provided hooks
  • Deploys an Unstructured Custom Model to your Deployments
  • Returns an object which can be used to make predictions

Use environment_id to re-use an existing Custom Model Environment that you’re happy with for shorter iteration cycles on the custom model hooks.

Note: See https://app.datarobot.com/docs/api/api-quickstart/index.html for instructions to setup a drconfig.yaml or call drx.Context() to initialize your credentials.

In [ ]:

import datarobotx as drx

drx.Context().endpoint = dr_client.endpoint
drx.Context().token = dr_client.token
In [ ]:

deployment = drx.deploy(
    "storage/deploy/",
    hooks={"score_unstructured": score_unstructured, "load_model": load_model},
    extra_requirements=[],
    # environment_id="6386dc1159c606b0d8beddc7",
)
Out [ ]:

# Deploying custom model
  - Unable to auto-detect model type; any provided paths and files will be
    exported - dependencies should be explicitly specified using
    `extra_requirements` or `environment_id`
  - Preparing model and environment...
  - Configured environment [[Custom]
    priceless-ganguly](https://app.datarobot.com/model-registry/custom-environments/65ac4115be769b7f85d5aaf9)
    with requirements:
      python 3.9.16
      datarobot-drum==1.10.14
      datarobot-mlops==9.2.8
      cloudpickle==2.2.1
  - Awaiting custom environment build...
Out [ ]:

  - Configuring and uploading custom model...

    100%|███████████████████████████| 11.0k/11.0k [00:00<00:00, 5.14MB/s]
  - Registered custom model
    [priceless-ganguly](https://app.datarobot.com/model-registry/custom-models/65ac42ce046ed058aada50c7/info)
    with target type: Unstructured
  - Creating and deploying model package...
Out [ ]:

  - Created deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Custom model deployment complete

Let’s try out our deployment and track the trajectory from the deployed policy (returns action)

In [ ]:

# If your deployment already occured or your notebook restarted due to inactivity, get ID from URL in the UI
# deployment = drx.Deployment("YOUR DEPLOYEMENT ID HERE")
deployment.predict_unstructured({"state": (0, 1)})
Out [ ]:

# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
{'action': 1}

Test and print trajectory.

In [ ]:

state = (0, 1)
goal_state = (3, 3)

print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    action = deployment.predict_unstructured({"state": state})["action"]
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)
Out [ ]:

Initial state: (0, 1)
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
[(0, 1), (1, 1), (2, 1), (3, 1), (3, 2), (3, 3)]
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started with Reinforcement Learning

Explore more Industry Agnostic AI Accelerators

Explore more AI Accelerators