Tabular Q-Learning · Temporal-Difference Control · ε-Greedy Exploration
A clean, dependency-minimal implementation of the Q-Learning algorithm applied to a stochastic grid navigation problem — inspired by OpenAI Gym's FrozenLake-v1 environment. Built to demonstrate core reinforcement learning principles without the abstraction layers of deep learning frameworks.
The agent operates in a 4×4 deterministic grid with the following topology:
| Cell | Type | Reward |
|---|---|---|
(0,0) |
Start (S) | — |
(1,3), (2,1), (3,1) |
Hole (✕) | −0.5 |
(3,3) |
Goal (T) | +0.5 |
| All others | Normal | −0.01 |
- State space: 16 cells (15 non-terminal states)
- Action space:
{U, D, L, R}— bounded by grid walls (invalid moves are excluded per state) - Transition dynamics: Deterministic — the agent moves exactly in the chosen direction
- Episode termination: Reaching the Goal cell
(3,3), falling into a Hole, or exceeding 20 steps
The small per-step penalty of −0.01 incentivises the agent to find the shortest path to the goal rather than wandering.
This implementation uses the classical Q-Learning update rule (Watkins, 1989) — an off-policy temporal-difference method that directly estimates the optimal action-value function
| Symbol | Parameter | Value |
|---|---|---|
| Learning rate | 0.1 |
|
| Discount factor | 0.9 |
|
| Exploration rate (ε-greedy) | 0.01 |
|
| — | Episodes | 2,001 |
| — | Max steps / episode | 20 |
Initialize Q(s, a) = 0 for all state-action pairs
Initialize π randomly
for episode = 1 to 2001:
s ← (0, 0) # start state
while s is not terminal and steps < 20:
a ← ε-greedy action from π(s)
s', r ← environment.step(s, a)
Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') − Q(s,a)]
s ← s'
Update π(s) ← argmax_a Q(s,a) for all s
Key property: Q-Learning is off-policy — the update uses
max Q(s',a')regardless of the action actually taken. This guarantees convergence to$Q^*$ under standard conditions (all state-action pairs visited infinitely often, decaying learning rate).
The agent converges to a stable positive average reward within ~200 episodes. The left panel shows the raw per-episode return (with 50-episode smoothing), while the right panel shows the cumulative average converging to ≈ 0.42 — indicating the agent reliably reaches the goal while avoiding all penalty cells.
The converged policy demonstrates two key behaviors:
- Hole avoidance — the agent routes around all three penalty cells at
(1,3),(2,1), and(3,1) - Shortest-path seeking — the policy directs the agent along the most efficient trajectory from
S → T
The state-value function
The full Q-table after convergence reveals the action-value estimates for every state-action pair. Note how Q-values for actions leading toward holes are strongly negative, while the optimal action in each state has the highest Q-value — exactly as expected.
The agent begins with a random policy (Episode 0) and rapidly converges to the optimal policy by Episode 200. The policy remains stable through the remaining 1,800 episodes — evidence of robust convergence.
The number of steps per episode drops from the maximum (20 — hitting the step limit) to the optimal path length of 6 steps, confirming the agent has discovered the shortest route. Occasional spikes correspond to the 1% exploration rate (ε = 0.01) forcing random actions.
With ε = 0.01, the agent exploits its learned policy 99% of the time while maintaining a minimal exploration budget — sufficient to escape local optima while preserving training stability.
.
├── Reinforcement_Learning_solving_a_simple_4_4_Gridworld_using_Q_learning.py # Standalone script
├── Reinforcement_Learning_solving_a_simple_4_4_Gridworld_using_Q_learning.ipynb # Jupyter notebook
├── generate_visuals.py # Visualization suite
├── assets/ # Generated figures
│ ├── gridworld_env.png
│ ├── training_curves.png
│ ├── optimal_policy.png
│ ├── qvalue_heatmap.png
│ ├── qtable_detail.png
│ ├── policy_evolution.png
│ ├── steps_convergence.png
│ └── explore_exploit.png
└── README.md
# Clone the repository
git clone https://github.com/Elktrn/Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Qlearning-in-python.git
cd Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Qlearning-in-python
# Install dependencies
pip install numpy matplotlib
# Run the Q-Learning agent
python Reinforcement_Learning_solving_a_simple_4_4_Gridworld_using_Q_learning.py
# (Optional) Regenerate all figures
python generate_visuals.py| What to change | Where |
|---|---|
| Grid size / layout | GridWorld.actions dictionary — add or remove (row, col) entries |
| Reward structure | GridWorld.rewards dictionary |
| Hyperparameters (α, γ, ε) | Variables alpha, gamma / 0.9, and exploreRate in the training loop |
| Number of episodes | range(1, N) in the training loop |
This project implements one of the foundational algorithms in Reinforcement Learning. Q-Learning belongs to the family of model-free, off-policy, temporal-difference methods and was first introduced by Watkins (1989). It is the tabular precursor to Deep Q-Networks (DQN) — the algorithm that achieved human-level performance on Atari games (Mnih et al., 2015).
Note: This is a tabular Q-Learning implementation — not Deep Q-Learning. For the deep learning variant using neural network function approximation, see my other repositories.
| Concept | How it appears |
|---|---|
| Temporal-Difference Learning | Single-step bootstrap updates using |
| Off-Policy Control | Policy improvement via greedy max over Q-values, independent of behavior policy |
| ε-Greedy Exploration | 1% random action probability ensures continued state-space coverage |
| Value Function Convergence | Q-table converges to |
| Policy Extraction | Greedy policy |
Built with NumPy and Matplotlib — no frameworks, no abstractions, just the math.
| R | | R | | D | | L |
----------------------------
| R | | R | | D | | L |
----------------------------
| D | | R | | R | | D |
----------------------------
| U | | R | | R |
----------------------------
| R | | R | | D | | L |
----------------------------
| R | | R | | D | | L |
----------------------------
| D | | R | | R | | D |
----------------------------
| U | | R | | R |
----------------------------
exploited:12482 explored:120







