Implements Value Iteration and Policy Iteration (from Sutton & Barto, Reinforcement Learning: An Introduction) to solve the Gambler's Problem. Analyzes the effect of discount factor γ and coin-flip probability p_h on the optimal policy.
Stack: Python, Dynamic Programming, MDP, NumPy, Matplotlib
The Gambler's Problem is a classic finite Markov Decision Process (MDP) in which a gambler bets on the outcome of repeated coin flips. The goal is to find the policy that maximizes the probability of reaching a capital of $100 before going broke.
- States: Gambler's capital s ∈ {1, 2, ..., 99}, with terminal states at s = 0 (loss) and s = 100 (win)
- Actions: Stakes a ∈ {0, 1, ..., min(s, 100 − s)} — the amount wagered on each flip
- Reward: +1 when s = 100 is reached; 0 otherwise
- Dynamics: With probability p_h, the gambler wins the stake; with probability (1 − p_h), they lose it
Solved for p_h = 0.25 and p_h = 0.55 using both:
- Value Iteration: Iteratively applies the Bellman optimality equation until the value function converges, then extracts the greedy policy.
- Policy Iteration: Alternates between full policy evaluation and policy improvement until the policy stabilizes.
The discount factors γ = 1.0, 0.99, and 0.90 are applied to analyze how future reward weighting affects the optimal policy shape.
An extended variant introduces two modifications:
- The gambler may stake up to their full current capital (not capped at 100 − s)
- A "North head" special outcome: with probability 1/8 on a heads flip, winnings are doubled
- Reward is modified to reflect the final capital reached above 99: reward = final_capital − 99
The state transition and reward functions are updated accordingly, and Value Iteration is re-applied.
- For p_h < 0.5, the optimal policy concentrates stakes at capital levels that maximize the probability of a single decisive win — producing the characteristic "staircase" value function.
- For p_h > 0.5, the gambler has an inherent edge; spreading smaller bets across more flips increases the probability of winning, flattening the optimal policy.
- Reducing γ from 1.0 to 0.90 significantly discounts terminal rewards, causing the agent to prefer faster paths to $100 even at greater risk.
- Policy Iteration and Value Iteration converge to identical optimal policies, confirming implementation correctness.
RL-Dynamic-Programming-GamblersProblem/
├── notebooks/
│ └── DS669MidtermProject_PabloSalar.ipynb # Full experiment notebook with policy visualizations
├── src/
│ └── DS669MidtermProject_PabloSalar.py # Standalone Python script
└── README.md
| Library | Purpose |
|---|---|
| NumPy | Bellman equation computation and array operations |
| Matplotlib | Value function and policy plots across parameters |
| Jupyter | Interactive experimentation and reporting |
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. — Chapter 4: Dynamic Programming.