Continuous REINFORCE (TensorFlow) — Mean/Variance Policy Network

A lightweight implementation of REINFORCE (policy gradient) for a continuous action space using TensorFlow / Keras. The policy network outputs an action mean and optionally a learned variance (or std), and the agent updates parameters using the Monte Carlo return.

This repo is intended as a minimal research/educational baseline you can adapt to your own environments.

Features

✅ Continuous-action REINFORCE agent
✅ Policy network with configurable MLP backbone
✅ Optional learned variance (stochastic policy with trainable uncertainty)
✅ Uses TensorFlow Probability (tfd.Normal) for stable log-prob computation
✅ Simple training script + learning curve plotting

Repository Structure (Suggested)

.
├── network_con.py        # ConPolicyGrad policy network (mu + optional var/std)
├── con_reinforce.py      # Agent (REINFORCE) with trajectory memory + update
├── con_main.py              # Training loop (example)
└── utils.py              # plot_learning() helper

If your filenames differ, update the import paths in train.py.

Installation

Requirements

Python 3.9+
TensorFlow 2.x
TensorFlow Probability
NumPy
Matplotlib (for plots)

Install:

pip install tensorflow tensorflow-probability numpy matplotlib

Quick Start

Run training:

python con_main.py

This will generate plots:

score.png — running-average episode reward
mu.png — running-average mean estimate
sigma.png — running-average std/variance behavior (if enabled)

How It Works

Policy Network (`ConPolicyGrad`)

The policy is a simple MLP:

Two hidden layers (ReLU)
Output head for mean mu
Optional output head for variance/std (positive via Softplus)

When learn_var=True, the network returns:

mu, var

Otherwise it returns:

mu

Agent (`Agent`)

The agent stores an episode trajectory:

states
actions
rewards

Then computes Monte Carlo returns:

$$ G_t = \sum_{k=t}^{T-1}\gamma^{k-t} r_k $$

And applies REINFORCE:

$$ \mathcal{L}(\theta) = -\mathbb{E}\left[G_t \log \pi_\theta(a_t|s_t)\right] $$

The log-probability is computed using:

tfd.Normal(loc=mu, scale=std).log_prob(action)

Configuration

You can customize training via:

learn_var: learn policy variance (True/False)
fixed_std: use constant std when learn_var=False
gamma: discount factor
alpha: learning rate
layer sizes (fc1_dims, fc2_dims, etc.)

Example:

agent = Agent(
    alpha=3e-3,
    gamma=0.99,
    learn_var=True,
    fixed_std=0.1,
)

Notes / Common Pitfalls

Returns computation: Make sure returns are computed from time t onward (not always from 0).
Shape consistency: Use consistent shapes for states/actions, ideally (batch, dim) for network inputs.
Variance stability: Enforce a minimum variance/std to avoid numerical issues (e.g., min_std=1e-3).

Extending to Real Environments

This repo currently uses a toy reward function. To integrate with Gym / custom environment:

Replace reward() with env.step(action)
Store transitions per step
Call learn() after each episode

Pseudo-code:

state = env.reset()
done = False
while not done:
    action = agent.choose_action(state)
    next_state, reward, done, _ = env.step(action)
    agent.store_transition(state, action, reward)
    state = next_state
agent.learn()

License

MIT License (recommended for reuse).

Contact

For questions, suggestions, or collaboration inquiries, please open a GitHub issue.

For direct communication, please email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
con_main.py		con_main.py
con_reinforce.py		con_reinforce.py
network_con.py		network_con.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous REINFORCE (TensorFlow) — Mean/Variance Policy Network

Features

Repository Structure (Suggested)

Installation

Requirements

Quick Start

How It Works

Policy Network (`ConPolicyGrad`)

Agent (`Agent`)

Configuration

Notes / Common Pitfalls

Extending to Real Environments

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Continuous REINFORCE (TensorFlow) — Mean/Variance Policy Network

Features

Repository Structure (Suggested)

Installation

Requirements

Quick Start

How It Works

Policy Network (ConPolicyGrad)

Agent (Agent)

Configuration

Notes / Common Pitfalls

Extending to Real Environments

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Policy Network (`ConPolicyGrad`)

Agent (`Agent`)

Packages