Skip to content

A university project where we implement and experiment with different Reinforcement Learning algorithms and trying to optimize the CartPole environment from OpenAI Gym.

License

Notifications You must be signed in to change notification settings

AntonisZks/CartPole-Optimization-with-Reinforcement-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CartPole Environment with Reinforcement Learning

License Repository Size Release Contributors

๐Ÿ“‹ Table of Contents

๐Ÿ‘ฅ Team Members

First Name Last Name Student ID Email
Antonis Zikas 1115202100038 sdi2100038@di.uoa.gr
Panagiotis Papapostolou 1115202100142 sdi2100142@di.uoa.gr

๐ŸŽฏ Project Overview

This project implements and experiments with various Reinforcement Learning algorithms to train agents on the CartPole-v1 environment from OpenAI Gymnasium. The main focus is on Deep Q-Network (DQN) implementations with different architectural variations and comparative analysis with other RL algorithms.

Key Features

  • ๐Ÿง  Multiple DQN Implementations: Standard DQN, Dueling Architecture, and Transformer-based Q-Networks
  • ๐Ÿ“Š Comprehensive Analysis: Performance comparison with random actions and sensitivity studies
  • ๐Ÿ”ง Modular Design: Clean, well-documented code structure
  • ๐Ÿ“ˆ Visualization: Detailed plotting and analysis of training results
  • ๐ŸŽฎ Environment Testing: Baseline performance analysis with random actions
  • ๐Ÿ† State-of-the-art Algorithms: Integration with Stable-Baselines3 (PPO, A2C)

๐ŸŽฎ Environment Analysis

CartPole-v1 Environment

The CartPole-v1 environment is a classic control problem where the goal is to balance a pole on a cart by moving the cart left or right.

Action Space

  • Discrete: 2 possible actions
    • 0: Move cart to the left
    • 1: Move cart to the right

Observation Space

  • Continuous: 4-dimensional state vector
    • [0]: Cart Position (range: -4.8 to 4.8)
    • [1]: Cart Velocity (range: -โˆž to +โˆž)
    • [2]: Pole Angle (range: ~-0.418 to 0.418 radians)
    • [3]: Pole Angular Velocity (range: -โˆž to +โˆž)

Reward System

  • +1 for each timestep the pole remains upright
  • Reward threshold: 500 (considered solved)
  • Episode terminates when pole angle > ยฑ12ยฐ or cart position > ยฑ2.4

๐Ÿš€ Implementation

Core Algorithms Implemented

  1. Deep Q-Network (DQN)

    • Experience replay buffer
    • Target network for stable learning
    • ฮต-greedy exploration strategy
  2. Dueling Architecture DQN

    • Separate value and advantage streams
    • Improved learning efficiency
  3. Transformer-based Q-Network

    • Sequential state processing
    • Attention mechanism for temporal dependencies
  4. Stable-Baselines3 Integration

    • Proximal Policy Optimization (PPO)
    • Advantage Actor-Critic (A2C)

Key Components

  • Neural Networks: Fully connected layers with ReLU activations
  • Replay Buffer: Experience replay for stable training
  • Target Networks: Periodic updates for learning stability
  • Exploration Strategy: ฮต-greedy with exponential decay

๐Ÿ“ Project Structure

Reinforcement-Learning-Assignment/
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ cart_pole.ipynb          # Main Jupyter notebook with experiments
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ agents.py                # DQN Agent implementation
โ”‚   โ”œโ”€โ”€ networks.py              # Neural network architectures
โ”‚   โ”œโ”€โ”€ trainers.py              # Training logic and utilities
โ”‚   โ”œโ”€โ”€ replay_buffers.py        # Experience replay buffer
โ”‚   โ”œโ”€โ”€ testing.py               # Model testing and evaluation
โ”‚   โ”œโ”€โ”€ plotting.py              # Visualization utilities
โ”‚   โ”œโ”€โ”€ utils.py                 # Helper functions and hyperparameters
โ”‚   โ”œโ”€โ”€ dqn.py                   # Main DQN training script
โ”‚   โ”œโ”€โ”€ env_showcase.py          # Environment demonstration
โ”‚   โ”œโ”€โ”€ stable_baselines_a2c.py  # A2C training with Stable-Baselines3
โ”‚   โ””โ”€โ”€ stable_baselines_ppo.py  # PPO training with Stable-Baselines3
โ”‚
โ”œโ”€โ”€ models/                      # Saved trained models
โ”‚   โ”œโ”€โ”€ dqn_model.pth
โ”‚   โ”œโ”€โ”€ dueling_arc_dqn_model.pth
โ”‚   โ”œโ”€โ”€ transformer_model.pth
โ”‚   โ”œโ”€โ”€ ppo_*.pth
โ”‚   โ””โ”€โ”€ a2c_*.pth
โ”‚
โ”œโ”€โ”€ reports/
โ”‚   โ”œโ”€โ”€ figs/                    # Generated plots and visualizations
โ”‚   โ””โ”€โ”€ PDFs/                    # Final report documents
โ”‚
โ”œโ”€โ”€ logs/
โ”‚   โ””โ”€โ”€ tensorboard/             # TensorBoard logging for training metrics
โ”‚
โ”œโ”€โ”€ assets/
โ”‚   โ””โ”€โ”€ imgs/                    # Images and diagrams
โ”‚
โ”œโ”€โ”€ docs/                        # Assignment documentation
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ””โ”€โ”€ README.md                    # This file

๐Ÿ”ง Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (optional, for faster training)

Setup Instructions

  1. Clone the repository

    git clone <repository-url>
    cd Reinforcement-Learning-Assignment
  2. Create virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt

Key Dependencies

  • torch: Deep learning framework
  • gymnasium: OpenAI Gym environments
  • stable-baselines3: State-of-the-art RL algorithms
  • matplotlib: Plotting and visualization
  • numpy: Numerical computations
  • jupyter: Interactive notebooks

๐ŸŽฎ Usage

Quick Start

  1. Run Environment Showcase

    python src/env_showcase.py
  2. Train DQN Agent

    python src/dqn.py
  3. Train with Stable-Baselines3

    python src/stable_baselines_ppo.py  # For PPO
    python src/stable_baselines_a2c.py  # For A2C
  4. Interactive Analysis

    jupyter notebook notebooks/cart_pole.ipynb

Hyperparameter Configuration

Key hyperparameters are defined in src/utils.py:

GAMMA = 0.99          # Discount factor
LR = 1e-3             # Learning rate
BATCH_SIZE = 64       # Minibatch size
MEMORY_SIZE = 10000   # Replay buffer size
EPSILON_START = 1.0   # Starting exploration probability
EPSILON_END = 0.01    # Minimum exploration probability
EPSILON_DECAY = 0.995 # Epsilon decay rate
TARGET_UPDATE = 10    # Target network update frequency

๐Ÿ“Š Results

Performance Comparison

Algorithm Average Score Success Rate Training Episodes
Random Actions ~22 ~10% N/A
DQN ~475+ ~95%+ 500
Dueling DQN ~480+ ~96%+ 500
Transformer DQN ~450+ ~90%+ 500
PPO (Stable-Baselines3) ~500 ~99% Variable
A2C (Stable-Baselines3) ~495+ ~98% Variable

Key Findings

  • โœ… All implemented algorithms significantly outperform random actions
  • โœ… Dueling architecture shows slight improvement over standard DQN
  • โœ… Stable-Baselines3 implementations achieve near-optimal performance
  • โœ… Transformer-based approach shows promise but requires tuning

๐Ÿง  Models and Architectures

Standard DQN Architecture

Input Layer (4 nodes) โ†’ Hidden Layer (128) โ†’ Hidden Layer (128) โ†’ Hidden Layer (128) โ†’ Output Layer (2 nodes)

Dueling Architecture

  • Shared layers: 4 โ†’ 128 โ†’ 128 โ†’ 128 โ†’ 128
  • Value stream: 128 โ†’ 64 โ†’ 1
  • Advantage stream: 128 โ†’ 64 โ†’ 2
  • Combination: Q(s,a) = V(s) + (A(s,a) - mean(A(s,ยท)))

Transformer Architecture

  • Sequence length: 10 timesteps
  • Embedding dimension: 64
  • Attention heads: 4
  • Encoder layers: 2

๐Ÿ“ˆ Monitoring and Visualization

The project includes comprehensive visualization tools:

  • Training Progress: Score and epsilon decay over episodes
  • Performance Comparison: Trained agents vs. random actions
  • Sensitivity Analysis: Hyperparameter impact studies
  • TensorBoard Integration: Real-time training metrics

๐Ÿค Contributing

This is an academic project for coursework. The implementation follows best practices for:

  • Code Organization: Modular, well-documented structure
  • Reproducibility: Seed setting for consistent results
  • Experimentation: Comprehensive sensitivity studies
  • Visualization: Clear, informative plots and metrics

This project is part of the coursework for Reinforcement Learning & Stochastic Games

National and Kapodistrian University of Athens

Department of Informatics and Telecommunications

About

A university project where we implement and experiment with different Reinforcement Learning algorithms and trying to optimize the CartPole environment from OpenAI Gym.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published