- Team Members
- Project Overview
- Environment Analysis
- Implementation
- Project Structure
- Installation
- Usage
- Results
- Models and Architectures
- Contributing
First Name | Last Name | Student ID | |
---|---|---|---|
Antonis | Zikas | 1115202100038 | sdi2100038@di.uoa.gr |
Panagiotis | Papapostolou | 1115202100142 | sdi2100142@di.uoa.gr |
This project implements and experiments with various Reinforcement Learning algorithms to train agents on the CartPole-v1 environment from OpenAI Gymnasium. The main focus is on Deep Q-Network (DQN) implementations with different architectural variations and comparative analysis with other RL algorithms.
- ๐ง Multiple DQN Implementations: Standard DQN, Dueling Architecture, and Transformer-based Q-Networks
- ๐ Comprehensive Analysis: Performance comparison with random actions and sensitivity studies
- ๐ง Modular Design: Clean, well-documented code structure
- ๐ Visualization: Detailed plotting and analysis of training results
- ๐ฎ Environment Testing: Baseline performance analysis with random actions
- ๐ State-of-the-art Algorithms: Integration with Stable-Baselines3 (PPO, A2C)
The CartPole-v1 environment is a classic control problem where the goal is to balance a pole on a cart by moving the cart left or right.
- Discrete: 2 possible actions
0
: Move cart to the left1
: Move cart to the right
- Continuous: 4-dimensional state vector
[0]
: Cart Position (range: -4.8 to 4.8)[1]
: Cart Velocity (range: -โ to +โ)[2]
: Pole Angle (range: ~-0.418 to 0.418 radians)[3]
: Pole Angular Velocity (range: -โ to +โ)
- +1 for each timestep the pole remains upright
- Reward threshold: 500 (considered solved)
- Episode terminates when pole angle > ยฑ12ยฐ or cart position > ยฑ2.4
-
Deep Q-Network (DQN)
- Experience replay buffer
- Target network for stable learning
- ฮต-greedy exploration strategy
-
Dueling Architecture DQN
- Separate value and advantage streams
- Improved learning efficiency
-
Transformer-based Q-Network
- Sequential state processing
- Attention mechanism for temporal dependencies
-
Stable-Baselines3 Integration
- Proximal Policy Optimization (PPO)
- Advantage Actor-Critic (A2C)
- Neural Networks: Fully connected layers with ReLU activations
- Replay Buffer: Experience replay for stable training
- Target Networks: Periodic updates for learning stability
- Exploration Strategy: ฮต-greedy with exponential decay
Reinforcement-Learning-Assignment/
โ
โโโ notebooks/
โ โโโ cart_pole.ipynb # Main Jupyter notebook with experiments
โ
โโโ src/
โ โโโ agents.py # DQN Agent implementation
โ โโโ networks.py # Neural network architectures
โ โโโ trainers.py # Training logic and utilities
โ โโโ replay_buffers.py # Experience replay buffer
โ โโโ testing.py # Model testing and evaluation
โ โโโ plotting.py # Visualization utilities
โ โโโ utils.py # Helper functions and hyperparameters
โ โโโ dqn.py # Main DQN training script
โ โโโ env_showcase.py # Environment demonstration
โ โโโ stable_baselines_a2c.py # A2C training with Stable-Baselines3
โ โโโ stable_baselines_ppo.py # PPO training with Stable-Baselines3
โ
โโโ models/ # Saved trained models
โ โโโ dqn_model.pth
โ โโโ dueling_arc_dqn_model.pth
โ โโโ transformer_model.pth
โ โโโ ppo_*.pth
โ โโโ a2c_*.pth
โ
โโโ reports/
โ โโโ figs/ # Generated plots and visualizations
โ โโโ PDFs/ # Final report documents
โ
โโโ logs/
โ โโโ tensorboard/ # TensorBoard logging for training metrics
โ
โโโ assets/
โ โโโ imgs/ # Images and diagrams
โ
โโโ docs/ # Assignment documentation
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
- Python 3.8+
- CUDA-compatible GPU (optional, for faster training)
-
Clone the repository
git clone <repository-url> cd Reinforcement-Learning-Assignment
-
Create virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
torch
: Deep learning frameworkgymnasium
: OpenAI Gym environmentsstable-baselines3
: State-of-the-art RL algorithmsmatplotlib
: Plotting and visualizationnumpy
: Numerical computationsjupyter
: Interactive notebooks
-
Run Environment Showcase
python src/env_showcase.py
-
Train DQN Agent
python src/dqn.py
-
Train with Stable-Baselines3
python src/stable_baselines_ppo.py # For PPO python src/stable_baselines_a2c.py # For A2C
-
Interactive Analysis
jupyter notebook notebooks/cart_pole.ipynb
Key hyperparameters are defined in src/utils.py
:
GAMMA = 0.99 # Discount factor
LR = 1e-3 # Learning rate
BATCH_SIZE = 64 # Minibatch size
MEMORY_SIZE = 10000 # Replay buffer size
EPSILON_START = 1.0 # Starting exploration probability
EPSILON_END = 0.01 # Minimum exploration probability
EPSILON_DECAY = 0.995 # Epsilon decay rate
TARGET_UPDATE = 10 # Target network update frequency
Algorithm | Average Score | Success Rate | Training Episodes |
---|---|---|---|
Random Actions | ~22 | ~10% | N/A |
DQN | ~475+ | ~95%+ | 500 |
Dueling DQN | ~480+ | ~96%+ | 500 |
Transformer DQN | ~450+ | ~90%+ | 500 |
PPO (Stable-Baselines3) | ~500 | ~99% | Variable |
A2C (Stable-Baselines3) | ~495+ | ~98% | Variable |
- โ All implemented algorithms significantly outperform random actions
- โ Dueling architecture shows slight improvement over standard DQN
- โ Stable-Baselines3 implementations achieve near-optimal performance
- โ Transformer-based approach shows promise but requires tuning
Input Layer (4 nodes) โ Hidden Layer (128) โ Hidden Layer (128) โ Hidden Layer (128) โ Output Layer (2 nodes)
- Shared layers: 4 โ 128 โ 128 โ 128 โ 128
- Value stream: 128 โ 64 โ 1
- Advantage stream: 128 โ 64 โ 2
- Combination: Q(s,a) = V(s) + (A(s,a) - mean(A(s,ยท)))
- Sequence length: 10 timesteps
- Embedding dimension: 64
- Attention heads: 4
- Encoder layers: 2
The project includes comprehensive visualization tools:
- Training Progress: Score and epsilon decay over episodes
- Performance Comparison: Trained agents vs. random actions
- Sensitivity Analysis: Hyperparameter impact studies
- TensorBoard Integration: Real-time training metrics
This is an academic project for coursework. The implementation follows best practices for:
- Code Organization: Modular, well-documented structure
- Reproducibility: Seed setting for consistent results
- Experimentation: Comprehensive sensitivity studies
- Visualization: Clear, informative plots and metrics
This project is part of the coursework for Reinforcement Learning & Stochastic Games
National and Kapodistrian University of Athens
Department of Informatics and Telecommunications