Skip to content

psalarc/RL-Dynamic-Programming-GamblersProblem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Gambler's Problem — Dynamic Programming with Value & Policy Iteration

Implements Value Iteration and Policy Iteration (from Sutton & Barto, Reinforcement Learning: An Introduction) to solve the Gambler's Problem. Analyzes the effect of discount factor γ and coin-flip probability p_h on the optimal policy.

Stack: Python, Dynamic Programming, MDP, NumPy, Matplotlib


Problem Description

The Gambler's Problem is a classic finite Markov Decision Process (MDP) in which a gambler bets on the outcome of repeated coin flips. The goal is to find the policy that maximizes the probability of reaching a capital of $100 before going broke.

  • States: Gambler's capital s ∈ {1, 2, ..., 99}, with terminal states at s = 0 (loss) and s = 100 (win)
  • Actions: Stakes a ∈ {0, 1, ..., min(s, 100 − s)} — the amount wagered on each flip
  • Reward: +1 when s = 100 is reached; 0 otherwise
  • Dynamics: With probability p_h, the gambler wins the stake; with probability (1 − p_h), they lose it

Approach

Task 1 — Standard Gambler's Problem

Solved for p_h = 0.25 and p_h = 0.55 using both:

  • Value Iteration: Iteratively applies the Bellman optimality equation until the value function converges, then extracts the greedy policy.
  • Policy Iteration: Alternates between full policy evaluation and policy improvement until the policy stabilizes.

The discount factors γ = 1.0, 0.99, and 0.90 are applied to analyze how future reward weighting affects the optimal policy shape.

Task 2 — Modified Rules

An extended variant introduces two modifications:

  • The gambler may stake up to their full current capital (not capped at 100 − s)
  • A "North head" special outcome: with probability 1/8 on a heads flip, winnings are doubled
  • Reward is modified to reflect the final capital reached above 99: reward = final_capital − 99

The state transition and reward functions are updated accordingly, and Value Iteration is re-applied.


Key Results

  • For p_h < 0.5, the optimal policy concentrates stakes at capital levels that maximize the probability of a single decisive win — producing the characteristic "staircase" value function.
  • For p_h > 0.5, the gambler has an inherent edge; spreading smaller bets across more flips increases the probability of winning, flattening the optimal policy.
  • Reducing γ from 1.0 to 0.90 significantly discounts terminal rewards, causing the agent to prefer faster paths to $100 even at greater risk.
  • Policy Iteration and Value Iteration converge to identical optimal policies, confirming implementation correctness.

Repository Structure

RL-Dynamic-Programming-GamblersProblem/
├── notebooks/
│   └── DS669MidtermProject_PabloSalar.ipynb   # Full experiment notebook with policy visualizations
├── src/
│   └── DS669MidtermProject_PabloSalar.py      # Standalone Python script
└── README.md

Technologies

Library Purpose
NumPy Bellman equation computation and array operations
Matplotlib Value function and policy plots across parameters
Jupyter Interactive experimentation and reporting

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. — Chapter 4: Dynamic Programming.

About

Value Iteration & Policy Iteration on the Gambler's Problem (Sutton & Barto). Analyzes how discount factor γ and coin-flip probability p_h affect optimal policy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors