top of page

Reinforcement Learning

Updated: Jan 18, 2022

Reinforcement learning (RL) is all about learning from the environment and learning to be more accurate with time. In RL, the agent does not have prior knowledge of the system. It gathers feedback to learn actions to maximize a specific objective.

Fig 1. Autonomous Car

Let's consider an example of Autonomous Car. The car is a agent and driving is the behaviour of the agent. The current states(St) of the agent are current geo-coordinates, speed, direction, visuals from cameras, LiDAR (Light Detection And Ranging) and other sensor readings. The actions(At) that can be taken by the agent are braking or accelerating and steering the car. Policy is for mapping the current state to action. The action taken by the agent provides an outcome back to the agent called reward(Rt+1). So, after the action taken the agent will be in new state(St+1) with new coordinates, speed, direction, etc.

Enivironment is the surroundings.

Model describes the rules of the game or the physics of the world. It produces the probability that the agent end up transitioning to S(t+1) given that its in state St and took action At.

Reward is a scalar value that the agent get from being in a state.

Policy tells the action to take for any given state.

Optimal policy maximizes long term expected reward. It prevents the agent from taking short term positive things if that is going to lead to long term negative things.

Markov property states that given the current state and action, the next state is independent of all preceding actions and states. Markov Decision Process (MDP) where the agent interacts with environment, solves the problem of change in situations by repeating the interaction and learning optimal action.

Fig 2. Agent Environmnet Interaction Cycle

The agent at time t is in state St. From the set of actions the agent can take in this state, it takes a specific action At. At this point, the system transitions to the next time period (t+1). The environment responds to the agent with a numerical reward Rt+1 as well as puts the agent in a new state of St+1. The cycle of "state to action to reward & next state" goes until the time agent reaches some end-goal state like the end of a game, the completion of a task, or the end of a specified number of time steps.

Positive Reinforcement Learning:

Someone clapping when you finally ride a cycle is a positive reward and called as positive reinforcement learning.

Negative Reinforcement Learning:

When you ride a cycle, you fell down and get hurt which is a negative reward and called as negative reinforcement learning.

Importance of Rewards:

Reinforcement Learning maximizes the total reward by taking action over time. Actions lead to rewards that might be given immediately or after a few steps. The agent repeats try and error attempts and observes which action gives higher reward.

Consider the driving zone shown in figure 3.

Fig 3. Driving Zone

Here we have 12 states which is 3 X 4 matrix. The actions that can be taken are FORWARD, REVERSE, LEFT and RIGHT. The reward for goal is +1 and the failure is -1.

Case 1: Higher Positive Reward

Consider the reward for each move is +5. As each move reward is higher than the goal, the car/agent will accumulate those rewards and avoids the goal and failure states.

Fig 4. Movement of Car when R(S) = +5

Case 2: Higher Negative Reward

Consider the reward for each move is -5. As each move reward is much lower than the goal, the car will try to end the game quickly. It will take the shortest path to end the game no matter it finishes at the goal or failure state.

Fig 5. Movement of Car when R(S) = -5

Case 3: Lower Negative Reward

Consider the reward for each move is -0.05. The small negative reward encourages the car to end the game and avoid going to the state that gives -1.

Fig 6. Movement of Car when R(S) = -0.05

In the above scenario, the agent lives forever and have an infinite time horizon to work with. From the position (3, 3), the car avoids going forward to position (2, 3) as there is a chance of end up with -1 failure. It chose to go the long way round because the negative reward is small compared to the positive reward (+1) where the car ends up. This makes sense only if the car runs forever and affords to take long rate.

What if the car only has 3 moves left? How do we push the agent to not focus just on reward collection but to do so in fastest possible time? This question leads to the concept of discounting.


By introducing the discount, we are adding the infinite sequence and still get something finite. The discount factor is similar to the financial world concept that the money now being more valuable than money later.


Gt - return

Ɣ - discount factor 0 < Ɣ < 1

R - reward

When Ɣ = 0,

Gt = Rt+1

This condition is called short sighted as the agent is greedy with respect to next step reward.

When Ɣ = 1,

This condition is called far sighted as the agent gives equal importance to all future rewards.

Thus, designing an appropriate reward based on the behaviour the agent needs is important.

Hope you got some basics of reinforcement learning :)


bottom of page