Off-policy vs On-Policy vs Offline Reinforcement Learning Demystified!, generated by author

In this article, we will try to understand where On-Policy learning, Off-policy learning and offline learning algorithms fundamentally differ. Though there is a fair amount of intimidating jargon in reinforcement learning theory, these are just based on simple ideas.

Let’s Begin with Understanding RL

Reinforcement Learning is a subfield of machine learning that teaches an agent how to choose an action from its action space. It interacts with an environment, in order to maximize rewards over time. Complex enough? let’s break this definition for better understanding.

Agent: The program you train, with the aim of doing a job you specify.
Environment: The world in which the agent performs actions.
Action: A move made by the agent, which causes a change in the environment.
Rewards: The evaluation of an action, which is like feedback.
States: This is what agent observes.

Plot 1 *[1]

Agents learn in an interactive environment by trial and error using feedback (Reward) from its own actions and experiences. Agent essentially tries different actions on the environment and learns from the feedback that it gets back. The goal is to find a suitable action policy that would maximize the total cumulative reward of the agent.

What is Policy?

The policy is simply a function that maps states to the actions, this policy can be simply approximated using neural networks ( with parameters θ ) which is also referred to as a functional approximation in traditional RL theory.

plot 2, Generated by author

The final goal in a reinforcement learning problem is to learn a policy, which defines a distribution over actions conditioned on states, π(a|s) or learn the parameters θ of this functional approximation.

How Policy is Trained

The process of reinforcement learning involves iteratively collecting data by interacting with the environment. This data is also referred to as experiences in RL theory. It is easy to appreciate why data is called experience if we understand the interaction of an agent with the environment.

Plot 3 *[1]

Traditionally, the agent observes the state of the environment (s) then takes action (a) based on policy π(a|s). Then agent gets a reward (r) and next state (s’). So collection of these experiences (<s,a,r,s’>) is the data which agent uses to train the policy ( parameters θ ).

Fundamentally Where On-Policy RL, Off-policy RL and Offline RL Differ

Now you understood what is a policy and how this policy is trained using data, which is a collection of experiences/ interactions.

All these methods fundamentally differ in how this data (collection of experiences) is generated

Typically the experiences are collected using the latest learned policy, and then using that experience to improve the policy. This is sort of online interaction. The agent interacts with the environment to collect the samples.

plot 4 *[2]

In on-policy reinforcement learning, the policy πk is updated with data collected by πk itself. We optimise the current policy πk and use it to determine what spaces and actions to explore and sample next. That means we will try to improve the same policy that the agent is already using for action selection. Policy used for data generation is called behaviour policy

Behaviour policy == Policy used for action selection

Examples: Policy Iteration, Sarsa, PPO, TRPO etc.

In the classic off-policy setting, the agent’s experience is appended to a data buffer (also called a replay buffer) D, and each new policy πk collects additional data, such that D is composed of samples from π0, π1, . . . , πk, and all of this data is used to train an updated new policy πk+1. The agent interacts with the environment to collect the samples.

plot 5 *[2]

Off-policy learning allows the use of older samples (collected using the older policies) in the calculation. To update the policy, experiences are sampled from a buffer which comprises experiences/interactions that are collected from its own predecessor policies. This improves sample efficiency since we don’t need to recollect samples whenever a policy is changed.

Behaviour policy ≠ Policy used for action selection

Examples: Q- learning, DQN, DDQN, DDPG etc.

Offline reinforcement learning algorithms: those utilize previously collected data, without additional online data collection. The agent no longer has the ability to interact with the environment and collect additional transitions using the behaviour policy. The learning algorithm is provided with a static dataset of fixed interaction, D, and must learn the best policy it can using this dataset. The learning algorithm doesn’t have access to additional data as it cannot interact with the environment.

plot 6 *[2]

This formulation more closely resembles the standard supervised learning problem statement, and we can regard D as the training set for the policy. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines.

No Behaviour policy

Examples: Batch Reinforcement Learning, BCRL


The theoretical differences between these techniques are clearly stated but the drawbacks and strengths are overwhelmingly complex to understand, we will save it for the next blog in this series. A RL practitioner must truly understand the computational complexity, pros, cons to evaluate the appropriateness of different methods for a given problem he/she is solving.


2. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems