Skip to content

This project implements and evaluates various Reinforcement Learning (RL) and Evolutionary Algorithm (EA) agents designed to play the classic game of Tetris

License

Notifications You must be signed in to change notification settings

GermanPaul12/AML-Tetris-Reinforcement-Learning

Repository files navigation

Aktuelle Data Science Entwicklungen II - Tetris Reinforcement Learning

Project Overview

This project implements and evaluates various Reinforcement Learning (RL) and Evolutionary Algorithm (EA) agents designed to play the classic game of Tetris. The primary goal is to explore different algorithmic approaches to solving this iconic game, comparing their learning efficiency, performance, and emergent strategies.

This project was developed as part of the Aktuelle Data Science Entwicklungen II - Reinforcement Learning course at DHBW Mannheim.

Contributors

Name Student ID Image
German Paul 9973152 German Paul
Michael Greif 5658606 Michael Greif

Getting Started

Prerequisites

  • Python 3.13
  • uv (Python package installer and virtual environment manager): https://github.com/astral-sh/uv
  • A C++ compiler (required by PyTorch for certain installations)

Running the Project

The main interface to interact with the agents (train, test, evaluate) is through main.py:

uv run main.py

This will launch an interactive command-line menu allowing you to:

  • Train specific agents (DQN, REINFORCE, A2C, PPO, Genetic Algorithm, Evolutionary Strategies).
  • Test a pre-trained agent and generate a video of its gameplay.
  • Evaluate all trained agents and generate a performance summary.

The project also includes dedicated training scripts for different categories of agents:

  • train_dqn_reinforce.py: For DQN and REINFORCE agents.
  • train_onpolicy.py: For A2C and PPO agents.
  • train_evolutionary.py: For Genetic Algorithm and Evolutionary Strategies.

These can be run directly if preferred, e.g.:

uv run train_dqn_reinforce.py --agent_type dqn
uv run train_evolutionary.py --agent_type genetic

Use the --help flag for more options on these scripts (e.g., uv run train_dqn_reinforce.py --help).

Project Structure

The project is organized as follows (up to level 3):

.
├── .git/                     # Git internal files (hidden)
├── .venv/                    # Virtual environment (if created here)
├── agents/
│   ├── __init__.py           # Makes 'agents' a package and defines AGENT_REGISTRY
│   ├── a2c_agent.py          # A2C agent implementation
│   ├── base_agent.py         # Abstract base class for agents
│   ├── dqn_agent.py          # DQN agent implementation
│   ├── es_agent.py           # Evolutionary Strategies agent implementation
│   ├── genetic_agent.py      # Genetic Algorithm agent implementation
│   ├── ppo_agent.py          # PPO agent 
│   ├── random_agent.py       # Random agent
│   └── reinforce_agent.py    # REINFORCE agent implementation
├── models/                   # Directory for saved agent models
│   └── gifs/                 # Subdirectory for videos generated by test.py
├── src/
│   └── tetris.py             # Core Tetris game logic
├── .gitignore
├── .python-version           # pyenv version file (if used)
├── config.py                 # Global configuration for hyperparameters, paths, etc.
├── evaluate.py               # Script to evaluate all trained agents
├── LICENSE
├── main.py                   # Main interactive launcher script
├── pyproject.toml            # Project metadata and dependencies for uv/Poetry/PEP517
├── README.md                 # This file
├── requirements.txt          # Project dependencies (for pip/uv)
├── test.py                   # Script to test a single trained agent and record video
├── train_dqn_reinforce.py    # Training script for DQN and REINFORCE
├── train_evolutionary.py     # Training script for GA and ES
├── train_onpolicy.py         # Training script for A2C and PPO
└── uv.lock                   # uv lock file

Key Components:

  • main.py: The main entry point for interacting with the project via a CLI menu.
  • agents/: Contains all the agent implementations.
    • base_agent.py: Defines an abstract BaseAgent class that all other agents inherit from, ensuring a consistent interface for select_action, learn, save, load, etc.
    • Individual agent files (dqn_agent.py, etc.): Implement the specific logic for each algorithm.
  • src/tetris.py: The core Tetris game environment. It handles game rules, piece movements, line clearing, and scoring. It also provides a method get_next_states() which is crucial for how the agents make decisions by evaluating potential outcomes of piece placements.
  • config.py: Centralized configuration for hyperparameters, model paths, training game counts, etc., for all agents.
  • Training Scripts (train_*.py): Dedicated scripts for training different categories of agents. They handle the main training loop, calling the agent's select_action and learn methods, and managing game episodes.
  • evaluate.py: Loads the best saved models for each agent type and runs a set number of games to gather performance statistics, outputting a summary table and a CSV file.
  • test.py: Loads a specific trained agent and records a video of its gameplay for a few games, focusing on the best-scoring game.
  • models/: Default directory where trained models are saved (usually with the score in the filename) and where test videos are stored.

Agent Explanations

This project implements several types of RL and EA agents to play Tetris. The Tetris environment is unique in that the "action" is to choose one of many possible placements for the current falling piece. The agents evaluate the board state after a hypothetical placement to make their decisions.

1. DQN (Deep Q-Network) Agent (agents/dqn_agent.py)

  • Type: Value-based, Off-policy Reinforcement Learning.
  • Network: Uses a Q-Network (typically a Multi-Layer Perceptron - MLP - in this project, as it takes handcrafted features as input) to estimate the Q-value (expected future reward) of a given board state after a piece has been placed.
  • Action Selection: During training, it uses epsilon-greedy exploration: with probability epsilon, it chooses a random placement; otherwise, it queries the Q-network for all possible next states (resulting from valid piece placements) and chooses the placement leading to the state with the highest Q-value. During testing, epsilon is set to 0 (greedy selection).
  • Learning:
    • Uses a Replay Buffer to store experiences: (S_t_features, S'_chosen_features, R_{t+1}, S_{t+1}_features, Done_{t+1}).
      • S_t_features: Board features before placing the current piece.
      • S'_chosen_features: Board features after the chosen piece placement. This is the input to the Q-network to get the "current Q-value".
      • R_{t+1}: Reward from the placement.
      • S_{t+1}_features: Board features when the next piece appears. This is used to calculate the target Q-value.
      • Done_{t+1}: If the game ended after S_{t+1}.
    • A separate Target Network is used to stabilize learning. The target Q-value is calculated as R_{t+1} + gamma * max_a' Q_target(features_of_state_after_a'_from_S_{t+1}).
    • The loss is typically Mean Squared Error (MSE) between the current Q-value and the target Q-value.
  • Training Script: train_dqn_reinforce.py (shared with REINFORCE, but their learning loops are distinct).

2. REINFORCE Agent (agents/reinforce_agent.py)

  • Type: Policy Gradient, On-policy Reinforcement Learning.
  • Network: Uses a Policy Network (MLP) that takes features of a board state after a piece has been placed and outputs a single score (logit).
  • Action Selection: For all possible next states S' (resulting from valid piece placements), the policy network computes their scores. A softmax is applied over these scores to create a probability distribution. An action (piece placement) is then sampled from this distribution.
  • Learning:
    • Collects trajectories of (log_prob_of_chosen_S', reward) for an entire episode (one full game of Tetris).
    • At the end of the episode, it calculates the discounted future returns (G_t) for each step.
    • The policy network is updated to increase the probability of actions that led to higher returns. The loss is typically - sum(log_prob_chosen_S' * G_t).
  • Training Script: train_dqn_reinforce.py.

3. A2C (Advantage Actor-Critic) Agent (agents/a2c_agent.py)

  • Type: Policy Gradient & Value-based, On-policy Reinforcement Learning.
  • Networks: Uses two networks (often with shared initial layers):
    • Actor: Takes features of a board state after a piece placement (S') and outputs a score/logit. Similar to REINFORCE, a softmax over scores for all possible S' gives action probabilities.
    • Critic: Takes features of the board state before a piece placement (S_t) and outputs an estimate of the value of that state, V(S_t).
  • Action Selection: The Actor network determines the probability distribution over possible piece placements (derived from scores of S' states), and an action is sampled.
  • Learning:
    • Learns at each step (after each piece placement).
    • Critic Update: The critic learns to better predict V(S_t) using the TD target: R_{t+1} + gamma * V(S_{t+1}). The loss is MSE between V(S_t) and this target.
    • Actor Update: The actor is updated using the advantage: A_t = R_{t+1} + gamma * V(S_{t+1}) - V(S_t). The loss encourages actions that lead to a positive advantage, typically -log_prob_chosen_S' * A_t. An entropy bonus can be added to encourage exploration.
  • Training Script: train_onpolicy.py (shared with PPO).

4. PPO (Proximal Policy Optimization) Agent (agents/ppo_agent.py)

  • Type: Policy Gradient, On-policy Reinforcement Learning. It's an improvement over A2C, often more stable and sample-efficient.
  • Networks: Similar to A2C, it uses an Actor and a Critic network.
    • Actor: Evaluates features of S' (state after potential action).
    • Critic: Evaluates features of S_t (state before action).
  • Action Selection: The Actor network defines a policy (distribution over S' scores), and an action is sampled.
  • Learning:
    • Collects a batch of experiences (a "horizon" of piece placements).
    • Calculates advantages using Generalized Advantage Estimation (GAE), which helps balance bias and variance.
    • Performs multiple epochs of updates on the collected batch of data.
    • The key PPO innovation is the "clipped surrogate objective function," which restricts how much the policy can change in one update, leading to more stable training. It compares the ratio of new policy probabilities to old policy probabilities.
  • Training Script: train_onpolicy.py.

5. Genetic Algorithm (GA) Agent (agents/genetic_agent.py)

  • Type: Evolutionary Algorithm.
  • Individuals: Each individual in the population is a PolicyNetwork (MLP that scores features of S'). Its "genes" are the network weights.
  • Fitness: The fitness of an individual is its average score obtained over one or more full games of Tetris.
  • Evolution Process (GeneticAlgorithmController):
    • Selection: Parents are chosen from the current population, typically using tournament selection (fitter individuals are more likely to be chosen).
    • Crossover: Genetic material (network weights) from two parents is combined to create offspring (e.g., by averaging weights or single-point crossover).
    • Mutation: Small random changes are introduced into the offspring's weights to maintain diversity.
    • Elitism: A few of the best individuals from the current generation are carried over directly to the next generation, unchanged.
    • This process is repeated for a set number of generations.
  • Usage for Evaluation/Testing: The GeneticAgent class acts as a wrapper that loads the best PolicyNetwork found by the GeneticAlgorithmController during training and uses it to select actions.
  • Training Script: train_evolutionary.py.

6. ES (Evolutionary Strategies) Agent (agents/es_agent.py)

  • Type: Evolutionary Algorithm (a type of black-box optimization).
  • Core Idea: ES directly evolves the parameters (weights) of a single "central" policy network.
  • Network: Uses a PolicyNetworkES (MLP that scores features of S').
  • Evolution Process:
    1. Start with a central set of weights for the policy network.
    2. In each generation, create a population of "perturbed" versions of these central weights by adding Gaussian noise.
    3. Evaluate the fitness (game score) of each set of perturbed weights by running Tetris games.
    4. Update the central weights by moving them in the direction of the "gradient" estimated from the fitness scores of the perturbed individuals. Essentially, weights that led to higher scores get "reinforced." The update is a weighted sum of the noise vectors, where weights are derived from the fitness scores.
    5. The learning rate and noise standard deviation (sigma) are key hyperparameters.
  • Usage for Evaluation/Testing: The ESAgent class uses its current best central policy network (updated through evolution) for action selection.
  • Training Script: train_evolutionary.py.

7. Random Agent (agents/random_agent.py)

  • Type: Baseline / Control.
  • Action Selection: Chooses a random valid piece placement from the available next states.
  • Learning: Does not learn.
  • Purpose: Provides a baseline performance to compare other agents against.

Qualitative Evaluation (Gameplay Videos/GIFs)

To get a visual sense of how our agents perform, we've recorded gameplay GIFs. These videos highlight the strategies (or lack thereof) learned by each agent.

  • Random Agent Gameplay:

    • As expected, the random agent makes moves without any apparent strategy, quickly leading to a game over with a low score.
    • Random Agent Gameplay
  • DQN Agent Gameplay:

    • Our DQN agent demonstrates significantly improved play. It's making more strategic placements, clearing lines effectively, and achieving high scores, reflecting its strong quantitative performance.
    • DQN Agent Gameplay
  • Genetic Algorithm (Best Individual) Gameplay:

    • The Genetic Algorithm showcases exceptional gameplay. The piece placements are highly optimized, leading to very efficient line clearing and the highest scores we've observed, consistent with its top ranking in our quantitative tests.
    • Genetic Algorithm Gameplay
  • REINFORCE Agent Gameplay:

    • The REINFORCE agent shows signs of learning, making some reasonable moves and outperforming the random agent. However, its strategy is not as refined as our top performers, aligning with its mid-tier quantitative results.
    • REINFORCE Agent Gameplay
  • A2C Agent Gameplay:

    • The improved A2C agent now demonstrates competent and strategic gameplay. It successfully clears lines and builds stacks with clear intent, achieving respectable scores. Its performance is much more in line with the PPO agent, showcasing a significant improvement in its learned policy and an ability to sustain longer games.
    • A2C Agent Gameplay
  • PPO Agent Gameplay:

    • Our PPO agent exhibits a noticeable improvement over random play, attempting to make sensible placements. While it achieves respectable scores, its strategy isn't as sophisticated or high-scoring as our top-tier agents like GENETIC or DQN.
    • PPO Agent Gameplay
  • Evolutionary Strategies (Best Policy) Gameplay:

    • The ES agent displays very strong and intelligent gameplay. It consistently makes strategic decisions, leading to high scores and effective line clearing, confirming its place as one of our top-performing agents.
    • ES Agent Gameplay

Quantitative Evaluation (Performance Metrics)

We evaluated our agents over 20 games after their respective training routines. The following table summarizes their performance.

Quantitative Evaluation Results:

Agent Avg Score Std Score Min Score Max Score Avg Pieces Avg Tetrominoes Avg Lines Cleared Num Eval Games
RANDOM 18.05 3.34 13 25 18.75 18.75 0.00 20
DQN 1686055.75 1918342.52 92544 8728554 256904.55 256904.55 102748.95 20
GENETIC 2123031.05 1813144.83 145855 5885656 344067.05 344067.05 137613.90 20
REINFORCE 22707.15 17638.53 1588 51922 3644.65 3644.65 1444.55 20
A2C 23580.20 16033.70 3695 58220 2453.10 2453.10 967.55 20
PPO 34448.35 31026.29 790 125177 3361.35 3361.35 1332.50 20
ES 1256402.85 1251838.17 39100 4977175 202922.95 202922.95 81156.30 20

Analysis of Our New Results:

  • Our Top Performers:
    • The Genetic Algorithm (GENETIC) emerged as our leading agent by a significant margin. It achieved the highest average score, maximum score, and lines cleared, which tells us its direct policy optimization approach is working exceptionally well with our current setup.
    • Our Evolutionary Strategies (ES) agent also performed extremely well, securing the second-highest average score and showing strong learning.
    • We're pleased to see our Deep Q-Network (DQN) has shown a massive improvement from our earlier evaluations. It's now achieving very high scores, comparable to ES. This strongly suggests that the fixes we implemented for model loading and ensuring proper evaluation mode for DQN were successful.
  • Mid-Tier Performers (Showing Encouraging Learning):
    • Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C) and REINFORCE are now demonstrating significant learning, with average scores in the tens of thousands. They are substantially outperforming our random agent baseline, though they still have some way to go to catch up with GENETIC, ES, and DQN.
  • Variability in Performance: We observed that many of our successful agents (GENETIC, ES, DQN) show high standard deviations in their scores. This indicates a wide range of performance across the evaluation games, which isn't entirely unexpected for a complex game like Tetris.
  • Impact of Model Loading & Evaluation Mode:
    • The dramatic improvement in both DQN's and A2C's performance after our review highlights the importance of applying rigorous model loading and evaluation mode procedures consistently across all agents. This was a key factor in achieving reliable and representative results.
  • Effectiveness of Different Approaches:
    • It seems direct policy search methods (GENETIC, ES) and value-based methods when properly evaluated (DQN) are proving very effective for us.
    • Our policy gradient methods (REINFORCE, PPO, A2C) are having mixed success, with A2C currently struggling. This could be due to their known sensitivity to hyperparameters, our choice of baseline, the current feature set, or the network architecture we've used.

Our Next Steps & Recommendations:

  1. Optimize Policy Gradient Agents: While PPO, A2C, and REINFORCE are learning effectively, there's a clear performance gap to our top-tier agents. Our priority is to bridge this gap through further hyperparameter tuning (learning rates, entropy coefficients, GAE parameters) and potentially exploring more complex network architectures for these agents.
  2. Ensure Consistency in Evaluation: We will continue to ensure all agents are evaluated under identical conditions and that our evaluation scripts robustly load the best available model for each agent, using our score-in-filename convention.
  3. Feature Engineering & Network Architecture (Longer-Term): For agents that aren't yet at the top tier, we might revisit our state representation (the current 4 heuristic features) and the complexity of our neural network architectures in the future. However, it's encouraging that GENETIC, ES, and DQN have demonstrated that high scores are achievable with the current setup.

Future Work and Potential Improvements

  • Improve RL Agent Learning:
    • Correct Target Calculation for DQN: Implement the more accurate target Q-value calculation for DQN that involves simulating next possible moves from S_{t+1} using the target network. This requires careful handling of game state reconstruction or storing more information in the replay buffer.
    • Refine A2C/PPO Network Inputs: Ensure the Actor network for A2C/PPO correctly models pi(Action | S_t) by taking S_t features (before piece placement) and outputting a distribution over the discrete placements derived from get_next_states(). The Critic should take S_t features and output V(S_t). This is a significant architectural change from the current model of scoring S' states.
    • Hyperparameter Tuning: Extensive tuning of learning rates, discount factors, network architectures, exploration schedules (for DQN), entropy coefficients (for A2C/PPO), etc.
    • Reward Shaping: Experiment with different reward functions beyond just lines cleared or game score to guide learning more effectively (e.g., penalties for holes, height).
  • Advanced State Features: Explore more sophisticated handcrafted features or consider using a raw pixel input with a Convolutional Neural Network (CNN) for the RL agents, which might capture more nuances of the board state.
  • Enhanced Evolutionary Algorithms:
    • More sophisticated crossover and mutation operators for GA.
    • Adaptive sigma/learning rates for ES.
  • Comparative Analysis: More rigorous comparison of learning curves, final performance, and sample efficiency across all agents.
  • Code Refinements: Move shared helper functions (like model saving/loading utilities) into a common utils.py file.

Sources and Inspirations

Foundational Papers & Algorithms

  • Deep Q-Network (DQN): Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
  • REINFORCE Algorithm: Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3), 229-256.
  • Asynchronous Advantage Actor-Critic (A3C/A2C): Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.
  • Proximal Policy Optimization (PPO): Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • NeuroEvolution of Augmenting Topologies (NEAT): Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2), 99-127.

Tetris AI Resources

Course Materials

  • Lectures, provided code examples, and guidance from the Aktuelle Data Science Entwicklungen II - Reinforcement Learning course at DHBW Mannheim, instructed by Janina Patzer.

Core Libraries & Documentation

About

This project implements and evaluates various Reinforcement Learning (RL) and Evolutionary Algorithm (EA) agents designed to play the classic game of Tetris

Topics

Resources

License

Stars

Watchers

Forks

Languages