Reinforcement Learning for Optimal Station Policy

Kowshik chilamkurthy
4 min readOct 24, 2022

Problem Statement

  1. Suppose you are in a SQUARE playground and a BALL pops up RANDOMLY (but uniformly) in the playground . Your job is to collect the ball and ‘’Wait For The Next Ball’’ to pop up.
  2. The next ball pops up after a fixed waiting time only after you collect the one on the ground.
  3. This ball collecting game runs till the time is out.
  4. More the balls you collect, more the reward you get at the end of game.

Problem: what would you do in the ‘waiting time’ in order to quickly collect the next balls that can pop anywhere in the ground. Assume that the waiting time is enough to circle the square ground once.

Reinforcement Learning

Reinforcement Learning is a subfield of machine learning that teaches an agent how to choose an action from its action space. It interacts with an environment, in order to maximize rewards over time. Complex enough? let’s break this definition for better understanding.

Agent: The program you train, with the aim of doing a job you specify.
Environment: The world in which the agent performs actions.
Action: A move made by the agent, which causes a change in the environment.
Rewards: The evaluation of an action, which is like feedback.

In any RL modelling task, it’s imperative to define these 4 essential elements. Before we define these elements for our Covid-19 problem, let’s first try to understand with an example: how agent learn actions in an environment?

Agent: Program controlling the movement of limbs Environment: The real world simulating gravity and laws of motion
Action: Move limb L with Θ degrees
Reward: Positive when it approaches destination; negative when it falls down.

Agents learn in an interactive environment by trial and error using feedback (Reward) from its own actions and experiences. Agent essentially tries different actions on the environment and learns from the feedback that it gets back. The goal is to find a suitable action policy that would maximize the total cumulative reward of the agent.

Optimal Station Policy Problem

Now let’s define these 4 essential elements for our pandemic control problem:
Agent: A Program controlling the movement of the runner through different actions.
Environment: The virtual grid world where the runner can move around and pick the target. The grid world can be easily created using the numpy array.

Actions: There are 5 actions that are possible. Up, Left, Down, Right, No move. Based on the actions taken by the algorithm, the runner can move around the grid,


Rewards must be dealt uniquely. The defination of reward is important and it greatly impacts the behaviour of the agent.

Reward (+): Pick Target
The agent must be rewarded if the runner picks up the target

Penalty (-): Cosine Distance Scale
The distance between runner and target must be minimised so the cosine distance between runner and target must be penalised

Penalty (-): Single Step
The runner must mimimize the total movement in order to find the optimal station policy.

Terminal State

The episode must end, if the agent is no longer capable to pick the targets. When the agent is exploring the beginning, the environment must reset if the agent is not yet picking up the targets. This is very important because the agent must learn from its mistakes and try new episodes with what it had already learnt in the previous episodes.
We must define a terminal state such that the environment resets when the agent reached that terminal state.

Total Game Time:
The total length of the game. If the agent reaches the maximum time steps thats allowed, the environment is reset.

Maximum Allowed Time to Pick
If the agent is not able to pickup the target, we track the time it takes and if the time > Maximum Allowed Time to Pick. Then the environment is reset.