A Beginner’s Guide to Q-Learning – Pt. 2

In the last post, we explored the main concepts of the Q-Learning algorithm through a simple and comprehensive example. Now, let’s dive deeper and learn about the exploration rate and action-selection policies. But first, here’s a quick recap:

Q-Learning is a reinforcement learning algorithm, which means it learns through a series of rewards and penalties over time. It’s considered a model-free algorithm because it learns solely by interacting with the environment, without prior knowledge of the reward function. The agent starts in an initial state and can choose from a set of possible actions—each leading to a different outcome and associated with a reward (or penalty). These values are used to calculate the Q-value, based on the Bellman equation. After each action, the Q-value is stored in the Q-table, and with each new episode, these values are updated to reflect what the agent has learned.

To calculate the Q-value, three key parameters must be defined: the learning rate (α), which determines how much new information overrides existing Q-values; the discount factor (γ), which controls the importance of future rewards—values closer to 1 give more weight to long-term gains; and the exploration rate (ε), which governs how often the agent explores new actions instead of consistently choosing the one with the highest known Q-value.

A greedy algorithm is a strategy that always selects the best available option at each step, without considering long-term consequences. In Q-Learning, the algorithm is considered greedy when the exploration rate is set to zero—meaning the agent always chooses the action with the highest known Q-value. While this might seem efficient, it comes with a major drawback: the agent can get stuck following a suboptimal path. Introducing an exploration rate allows the agent to occasionally try new actions, enabling it to gather more information about the environment and potentially discover better strategies.

When the exploration rate is set high (close to 1), the agent primarily explores the environment; when it is low (e.g., 0.01), the agent mainly exploits—choosing the best known action. A common approach to balance exploration and exploitation is to gradually decay the exploration rate over time. This way, the agent starts by actively gathering information about the environment and, as learning progresses, shifts toward making safer decisions based on the knowledge it has accumulated—moving closer to optimal reward values. Here are the most common ε-decay strategies:

1. Exponential decay

ε_t+1 = ε_t ∙decay_rate

which leads to ε_t = ε₀ ∙ (decay_rate)^t

When the decay rate is too low (e.g., 0.5), the agent might not explore enough; but if it’s too close to 1, the agent may spend too much time exploring and delay convergence. The exponential decay approach is one of the most commonly used because it strikes a practical balance: it prevents the agent from getting stuck too early in suboptimal paths while also avoiding endless exploration. It’s easy to understand and implement, and it allows for a smooth transition from exploration to exploitation over time.

2. Linear decay

ε = ε_start – ((ε_start – ε_min)/N)∙episode

where N is the number of episodes over which ε is decaying and episode is the current episode number. It is important to clamp it on ε_min or ε will become negative.

3. Step decay

This is a simple yet less smooth approach, where different values of the exploration rate are manually defined for specific episode intervals. For example:

Episode range	ε (exploration rate)
0-99	1.0
100-199	0.5
200 +	0.1

What is a policy?

In reinforcement learning, a policy is the agent’s decision-making strategy. It defines what action the agent should take when it’s in a particular state.

Mathematically, a policy is represented as:
π: S → A
This means a policy π is a function that maps states (S) to actions (A). It governs the agent’s behavior over time and improves as the Q-values become more accurate.

We’ve already discussed two types of policies:

Greedy policy: The agent always selects the action with the highest Q-value—typically used during the exploitation phase.
ε-greedy policy: The agent chooses a random action with probability ε, and the best known action with probability 1 – ε—commonly used during training to encourage exploration.

There are also other types of policies:

Policy Type	Description	Used when
Softmax	Chooses actions probabilistically, based on Q-values	When smoother exploration is needed
Stochastic	Outputs a probability distribution over actions	Useful in policy gradient methods
Deterministic	Always selects the same action for a given state	Common in Q-learning

Now let’s use a very simple example to understand better the ε – Greedy policy.

Consider the following environment, where an agent starts at S1 and it gets 4 points of reward when it gets to S4 and 10 points when it gets to S6 (final state):

Let’s consider a learning rate of 0.5, a discount rate of 0.8 and an exploration rate of 0.9, with exponential decay rate of 0.9.

By judging the given environment it’s easy to know that the only action which will result in Q-values different than zero on the first episode are S1 → S4, S3 → S6 and S5 → S6. Below you can see the calculations for these Q-values and the Q-table after episode 1 (remember that all initial Q-values are zero):

Step 1: S1 → S4

Reward (r): 4

Q(S1, S4) = Q(S1, S4) + α ∙(r + γ∙max(Q(S4, ∙)) – Q(S1, S4))

Q(S1, S4) = 0 +0.5 ∙ ((4 + 0.8*0) – 0) = 0.5 ∙ 4 = 2.0

(where max(Q(S4, ∙)) is the maximum Q-value over all possible actions the agent can take from state 4)

Step 2: S4 → S5

Reward (r): 0

Q(S4, S5) = Q(S4, S5) + α ∙(r + γ∙max(Q(S5, ∙)) – Q(S4, S5))

Q(S4, S5) = 0 + 0.5 ∙(0 + 0.8 ∙0 – 0) = 0

Step 3: S5 → S6

Reward (r): 10 (terminal)

Q(S5,S6) = 0 + 0.5 ∙ (10 + 0.8∙0 – 0) = 5.0

The Q-table at the end of the first episode is:

State	Action (next state)	Q-value
S1	S2	0.0
S1	S4	2.0
S2	S1	0.0
S2	S3	0.0
S2	S5	0.0
S3	S2	0.0
S3	S6	0.0
S4	S1	0.0
S4	S5	0.0
S5	S2	0.0
S5	S4	0.0
S5	S6	5.0

In Q-learning, the exploration rate (commonly denoted as ε in an ε-greedy strategy) does not directly appear in the Q-value update formula itself. Instead, it plays a crucial role in the action selection strategy, determining whether the agent explores or exploits at each step. For the next episode this rate is going to be updated according to the following calculations:

ε₁ = ε₀ ∙ (decay rate) = 0.9 ∙ 0.9 = 0.81

An exploration rate of 0.81 means that your agent will choose a random action (explore) 81% of the time, and will choose the best-known action (exploit) based on the current Q-values only 19% of the time.

Now, let’s implement this algorithm in Python!

import numpy as np
import pandas as pd
import random

#Define the environment
states = ['S1', 'S2', 'S3', 'S4', 'S5', 'S6']
terminal_state = 'S6'
# Define transitions with rewards
transitions = {
    'S1': {'S2': 0, 'S4': 4},
    'S2': {'S1': 0, 'S3': 0, 'S5': 0},
    'S3': {'S2': 0, 'S6': 10},
    'S4': {'S1': 0, 'S5': 0},
    'S5': {'S2': 0, 'S4': 0, 'S6': 10},
    'S6': {}  # Terminal
}

q_table = pd.DataFrame(0.0, index=states, columns=states)
alpha = 0.9       # learning rate
gamma = 0.5       # discount factor
epsilon_start = 0.9
epsilon_min = 0.1
epsilon_values = []
decay_rate = 0.5
episodes = 20
max_steps = 4
steps = 0

# Training loop
for episode in range(episodes):
    state = 'S1'
    epsilon = max(epsilon_min, epsilon_start * (decay_rate ** episode))
    epsilon_values.append(epsilon)
    #the agent always starts a new episode in state 1 and the 
    #exploration rate is updated

    while state != terminal_state or steps < max_steps:
        actions = transitions[state]
        if not actions:
            break

            #The algorithms continues to run while the current state
            #is different from final state

        # ε-greedy action selection
        if np.random.rand() < epsilon:
            next_state = random.choice(list(actions.keys()))
        else:
            next_state = q_table.loc[state, actions.keys()].idxmax()

            #if a random number is lower than exploration rate
            #the next state is selected randomly

        # Q-value update
        reward = transitions[state][next_state]
        future_q = q_table.loc[next_state].max() if next_state != terminal_state else 0
        current_q = q_table.loc[state, next_state]
        q_table.loc[state, next_state] = current_q + alpha * (reward + gamma * future_q - current_q)

        state = next_state
        steps += 1

# Final Q-table
print("Final Q-table after", episodes, "episodes:\n")
print(q_table.round(2))

# Finding the shortest path
state = 'S1'
path = [state]
for i in range(max_steps):
      actions = transitions[state]
      if not actions:
          break
            #The algorithms continues to run while the current state
            #is different from final state

      next_state = q_table.loc[state, actions.keys()].idxmax()
      path.append(next_state)
      state = next_state
      
print("Shortest path:", ' -> '.join(path))

By adjusting learning parameters like the exploration rate and monitoring how the Q-values evolve, we can guide the agent toward increasingly optimal behavior. While early episodes may involve inefficient loops or detours, with enough training and proper tuning, the algorithm can converge to a policy that consistently leads to the highest rewards.

The Data Viewfinder