This project implements and evaluates various Reinforcement Learning (RL) and Evolutionary Algorithm (EA) agents designed to play the classic game of Tetris. The primary goal is to explore different algorithmic approaches to solving this iconic game, comparing their learning efficiency, performance, and emergent strategies.
This project was developed as part of the Aktuelle Data Science Entwicklungen II - Reinforcement Learning course at DHBW Mannheim.
Name | Student ID | Image |
---|---|---|
German Paul | 9973152 | ![]() |
Michael Greif | 5658606 | ![]() |
- Python 3.13
uv
(Python package installer and virtual environment manager): https://github.com/astral-sh/uv- A C++ compiler (required by PyTorch for certain installations)
The main interface to interact with the agents (train, test, evaluate) is through main.py
:
uv run main.py
This will launch an interactive command-line menu allowing you to:
- Train specific agents (DQN, REINFORCE, A2C, PPO, Genetic Algorithm, Evolutionary Strategies).
- Test a pre-trained agent and generate a video of its gameplay.
- Evaluate all trained agents and generate a performance summary.
The project also includes dedicated training scripts for different categories of agents:
train_dqn_reinforce.py
: For DQN and REINFORCE agents.train_onpolicy.py
: For A2C and PPO agents.train_evolutionary.py
: For Genetic Algorithm and Evolutionary Strategies.
These can be run directly if preferred, e.g.:
uv run train_dqn_reinforce.py --agent_type dqn
uv run train_evolutionary.py --agent_type genetic
Use the --help
flag for more options on these scripts (e.g., uv run train_dqn_reinforce.py --help
).
The project is organized as follows (up to level 3):
.
├── .git/ # Git internal files (hidden)
├── .venv/ # Virtual environment (if created here)
├── agents/
│ ├── __init__.py # Makes 'agents' a package and defines AGENT_REGISTRY
│ ├── a2c_agent.py # A2C agent implementation
│ ├── base_agent.py # Abstract base class for agents
│ ├── dqn_agent.py # DQN agent implementation
│ ├── es_agent.py # Evolutionary Strategies agent implementation
│ ├── genetic_agent.py # Genetic Algorithm agent implementation
│ ├── ppo_agent.py # PPO agent
│ ├── random_agent.py # Random agent
│ └── reinforce_agent.py # REINFORCE agent implementation
├── models/ # Directory for saved agent models
│ └── gifs/ # Subdirectory for videos generated by test.py
├── src/
│ └── tetris.py # Core Tetris game logic
├── .gitignore
├── .python-version # pyenv version file (if used)
├── config.py # Global configuration for hyperparameters, paths, etc.
├── evaluate.py # Script to evaluate all trained agents
├── LICENSE
├── main.py # Main interactive launcher script
├── pyproject.toml # Project metadata and dependencies for uv/Poetry/PEP517
├── README.md # This file
├── requirements.txt # Project dependencies (for pip/uv)
├── test.py # Script to test a single trained agent and record video
├── train_dqn_reinforce.py # Training script for DQN and REINFORCE
├── train_evolutionary.py # Training script for GA and ES
├── train_onpolicy.py # Training script for A2C and PPO
└── uv.lock # uv lock file
Key Components:
main.py
: The main entry point for interacting with the project via a CLI menu.agents/
: Contains all the agent implementations.base_agent.py
: Defines an abstractBaseAgent
class that all other agents inherit from, ensuring a consistent interface forselect_action
,learn
,save
,load
, etc.- Individual agent files (
dqn_agent.py
, etc.): Implement the specific logic for each algorithm.
src/tetris.py
: The core Tetris game environment. It handles game rules, piece movements, line clearing, and scoring. It also provides a methodget_next_states()
which is crucial for how the agents make decisions by evaluating potential outcomes of piece placements.config.py
: Centralized configuration for hyperparameters, model paths, training game counts, etc., for all agents.- Training Scripts (
train_*.py
): Dedicated scripts for training different categories of agents. They handle the main training loop, calling the agent'sselect_action
andlearn
methods, and managing game episodes. evaluate.py
: Loads the best saved models for each agent type and runs a set number of games to gather performance statistics, outputting a summary table and a CSV file.test.py
: Loads a specific trained agent and records a video of its gameplay for a few games, focusing on the best-scoring game.models/
: Default directory where trained models are saved (usually with the score in the filename) and where test videos are stored.
This project implements several types of RL and EA agents to play Tetris. The Tetris environment is unique in that the "action" is to choose one of many possible placements for the current falling piece. The agents evaluate the board state after a hypothetical placement to make their decisions.
- Type: Value-based, Off-policy Reinforcement Learning.
- Network: Uses a Q-Network (typically a Multi-Layer Perceptron - MLP - in this project, as it takes handcrafted features as input) to estimate the Q-value (expected future reward) of a given board state after a piece has been placed.
- Action Selection: During training, it uses epsilon-greedy exploration: with probability epsilon, it chooses a random placement; otherwise, it queries the Q-network for all possible next states (resulting from valid piece placements) and chooses the placement leading to the state with the highest Q-value. During testing, epsilon is set to 0 (greedy selection).
- Learning:
- Uses a Replay Buffer to store experiences:
(S_t_features, S'_chosen_features, R_{t+1}, S_{t+1}_features, Done_{t+1})
.S_t_features
: Board features before placing the current piece.S'_chosen_features
: Board features after the chosen piece placement. This is the input to the Q-network to get the "current Q-value".R_{t+1}
: Reward from the placement.S_{t+1}_features
: Board features when the next piece appears. This is used to calculate the target Q-value.Done_{t+1}
: If the game ended afterS_{t+1}
.
- A separate Target Network is used to stabilize learning. The target Q-value is calculated as
R_{t+1} + gamma * max_a' Q_target(features_of_state_after_a'_from_S_{t+1})
. - The loss is typically Mean Squared Error (MSE) between the current Q-value and the target Q-value.
- Uses a Replay Buffer to store experiences:
- Training Script:
train_dqn_reinforce.py
(shared with REINFORCE, but their learning loops are distinct).
- Type: Policy Gradient, On-policy Reinforcement Learning.
- Network: Uses a Policy Network (MLP) that takes features of a board state after a piece has been placed and outputs a single score (logit).
- Action Selection: For all possible next states S' (resulting from valid piece placements), the policy network computes their scores. A softmax is applied over these scores to create a probability distribution. An action (piece placement) is then sampled from this distribution.
- Learning:
- Collects trajectories of
(log_prob_of_chosen_S', reward)
for an entire episode (one full game of Tetris). - At the end of the episode, it calculates the discounted future returns (G_t) for each step.
- The policy network is updated to increase the probability of actions that led to higher returns. The loss is typically
- sum(log_prob_chosen_S' * G_t)
.
- Collects trajectories of
- Training Script:
train_dqn_reinforce.py
.
- Type: Policy Gradient & Value-based, On-policy Reinforcement Learning.
- Networks: Uses two networks (often with shared initial layers):
- Actor: Takes features of a board state after a piece placement (S') and outputs a score/logit. Similar to REINFORCE, a softmax over scores for all possible S' gives action probabilities.
- Critic: Takes features of the board state before a piece placement (S_t) and outputs an estimate of the value of that state, V(S_t).
- Action Selection: The Actor network determines the probability distribution over possible piece placements (derived from scores of S' states), and an action is sampled.
- Learning:
- Learns at each step (after each piece placement).
- Critic Update: The critic learns to better predict V(S_t) using the TD target:
R_{t+1} + gamma * V(S_{t+1})
. The loss is MSE betweenV(S_t)
and this target. - Actor Update: The actor is updated using the advantage:
A_t = R_{t+1} + gamma * V(S_{t+1}) - V(S_t)
. The loss encourages actions that lead to a positive advantage, typically-log_prob_chosen_S' * A_t
. An entropy bonus can be added to encourage exploration.
- Training Script:
train_onpolicy.py
(shared with PPO).
- Type: Policy Gradient, On-policy Reinforcement Learning. It's an improvement over A2C, often more stable and sample-efficient.
- Networks: Similar to A2C, it uses an Actor and a Critic network.
- Actor: Evaluates features of S' (state after potential action).
- Critic: Evaluates features of S_t (state before action).
- Action Selection: The Actor network defines a policy (distribution over S' scores), and an action is sampled.
- Learning:
- Collects a batch of experiences (a "horizon" of piece placements).
- Calculates advantages using Generalized Advantage Estimation (GAE), which helps balance bias and variance.
- Performs multiple epochs of updates on the collected batch of data.
- The key PPO innovation is the "clipped surrogate objective function," which restricts how much the policy can change in one update, leading to more stable training. It compares the ratio of new policy probabilities to old policy probabilities.
- Training Script:
train_onpolicy.py
.
- Type: Evolutionary Algorithm.
- Individuals: Each individual in the population is a
PolicyNetwork
(MLP that scores features of S'). Its "genes" are the network weights. - Fitness: The fitness of an individual is its average score obtained over one or more full games of Tetris.
- Evolution Process (
GeneticAlgorithmController
):- Selection: Parents are chosen from the current population, typically using tournament selection (fitter individuals are more likely to be chosen).
- Crossover: Genetic material (network weights) from two parents is combined to create offspring (e.g., by averaging weights or single-point crossover).
- Mutation: Small random changes are introduced into the offspring's weights to maintain diversity.
- Elitism: A few of the best individuals from the current generation are carried over directly to the next generation, unchanged.
- This process is repeated for a set number of generations.
- Usage for Evaluation/Testing: The
GeneticAgent
class acts as a wrapper that loads the bestPolicyNetwork
found by theGeneticAlgorithmController
during training and uses it to select actions. - Training Script:
train_evolutionary.py
.
- Type: Evolutionary Algorithm (a type of black-box optimization).
- Core Idea: ES directly evolves the parameters (weights) of a single "central" policy network.
- Network: Uses a
PolicyNetworkES
(MLP that scores features of S'). - Evolution Process:
- Start with a central set of weights for the policy network.
- In each generation, create a population of "perturbed" versions of these central weights by adding Gaussian noise.
- Evaluate the fitness (game score) of each set of perturbed weights by running Tetris games.
- Update the central weights by moving them in the direction of the "gradient" estimated from the fitness scores of the perturbed individuals. Essentially, weights that led to higher scores get "reinforced." The update is a weighted sum of the noise vectors, where weights are derived from the fitness scores.
- The learning rate and noise standard deviation (sigma) are key hyperparameters.
- Usage for Evaluation/Testing: The
ESAgent
class uses its current best central policy network (updated through evolution) for action selection. - Training Script:
train_evolutionary.py
.
- Type: Baseline / Control.
- Action Selection: Chooses a random valid piece placement from the available next states.
- Learning: Does not learn.
- Purpose: Provides a baseline performance to compare other agents against.
To get a visual sense of how our agents perform, we've recorded gameplay GIFs. These videos highlight the strategies (or lack thereof) learned by each agent.
-
Random Agent Gameplay:
-
DQN Agent Gameplay:
-
Genetic Algorithm (Best Individual) Gameplay:
-
REINFORCE Agent Gameplay:
-
A2C Agent Gameplay:
- The improved A2C agent now demonstrates competent and strategic gameplay. It successfully clears lines and builds stacks with clear intent, achieving respectable scores. Its performance is much more in line with the PPO agent, showcasing a significant improvement in its learned policy and an ability to sustain longer games.
-
PPO Agent Gameplay:
-
Evolutionary Strategies (Best Policy) Gameplay:
We evaluated our agents over 20 games after their respective training routines. The following table summarizes their performance.
Quantitative Evaluation Results:
Agent | Avg Score | Std Score | Min Score | Max Score | Avg Pieces | Avg Tetrominoes | Avg Lines Cleared | Num Eval Games |
---|---|---|---|---|---|---|---|---|
RANDOM | 18.05 | 3.34 | 13 | 25 | 18.75 | 18.75 | 0.00 | 20 |
DQN | 1686055.75 | 1918342.52 | 92544 | 8728554 | 256904.55 | 256904.55 | 102748.95 | 20 |
GENETIC | 2123031.05 | 1813144.83 | 145855 | 5885656 | 344067.05 | 344067.05 | 137613.90 | 20 |
REINFORCE | 22707.15 | 17638.53 | 1588 | 51922 | 3644.65 | 3644.65 | 1444.55 | 20 |
A2C | 23580.20 | 16033.70 | 3695 | 58220 | 2453.10 | 2453.10 | 967.55 | 20 |
PPO | 34448.35 | 31026.29 | 790 | 125177 | 3361.35 | 3361.35 | 1332.50 | 20 |
ES | 1256402.85 | 1251838.17 | 39100 | 4977175 | 202922.95 | 202922.95 | 81156.30 | 20 |
Analysis of Our New Results:
- Our Top Performers:
- The Genetic Algorithm (GENETIC) emerged as our leading agent by a significant margin. It achieved the highest average score, maximum score, and lines cleared, which tells us its direct policy optimization approach is working exceptionally well with our current setup.
- Our Evolutionary Strategies (ES) agent also performed extremely well, securing the second-highest average score and showing strong learning.
- We're pleased to see our Deep Q-Network (DQN) has shown a massive improvement from our earlier evaluations. It's now achieving very high scores, comparable to ES. This strongly suggests that the fixes we implemented for model loading and ensuring proper evaluation mode for DQN were successful.
- Mid-Tier Performers (Showing Encouraging Learning):
- Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C) and REINFORCE are now demonstrating significant learning, with average scores in the tens of thousands. They are substantially outperforming our random agent baseline, though they still have some way to go to catch up with GENETIC, ES, and DQN.
- Variability in Performance: We observed that many of our successful agents (GENETIC, ES, DQN) show high standard deviations in their scores. This indicates a wide range of performance across the evaluation games, which isn't entirely unexpected for a complex game like Tetris.
- Impact of Model Loading & Evaluation Mode:
- The dramatic improvement in both DQN's and A2C's performance after our review highlights the importance of applying rigorous model loading and evaluation mode procedures consistently across all agents. This was a key factor in achieving reliable and representative results.
- Effectiveness of Different Approaches:
- It seems direct policy search methods (GENETIC, ES) and value-based methods when properly evaluated (DQN) are proving very effective for us.
- Our policy gradient methods (REINFORCE, PPO, A2C) are having mixed success, with A2C currently struggling. This could be due to their known sensitivity to hyperparameters, our choice of baseline, the current feature set, or the network architecture we've used.
Our Next Steps & Recommendations:
- Optimize Policy Gradient Agents: While PPO, A2C, and REINFORCE are learning effectively, there's a clear performance gap to our top-tier agents. Our priority is to bridge this gap through further hyperparameter tuning (learning rates, entropy coefficients, GAE parameters) and potentially exploring more complex network architectures for these agents.
- Ensure Consistency in Evaluation: We will continue to ensure all agents are evaluated under identical conditions and that our evaluation scripts robustly load the best available model for each agent, using our score-in-filename convention.
- Feature Engineering & Network Architecture (Longer-Term): For agents that aren't yet at the top tier, we might revisit our state representation (the current 4 heuristic features) and the complexity of our neural network architectures in the future. However, it's encouraging that GENETIC, ES, and DQN have demonstrated that high scores are achievable with the current setup.
- Improve RL Agent Learning:
- Correct Target Calculation for DQN: Implement the more accurate target Q-value calculation for DQN that involves simulating next possible moves from S_{t+1} using the target network. This requires careful handling of game state reconstruction or storing more information in the replay buffer.
- Refine A2C/PPO Network Inputs: Ensure the Actor network for A2C/PPO correctly models
pi(Action | S_t)
by takingS_t
features (before piece placement) and outputting a distribution over the discrete placements derived fromget_next_states()
. The Critic should takeS_t
features and outputV(S_t)
. This is a significant architectural change from the current model of scoring S' states. - Hyperparameter Tuning: Extensive tuning of learning rates, discount factors, network architectures, exploration schedules (for DQN), entropy coefficients (for A2C/PPO), etc.
- Reward Shaping: Experiment with different reward functions beyond just lines cleared or game score to guide learning more effectively (e.g., penalties for holes, height).
- Advanced State Features: Explore more sophisticated handcrafted features or consider using a raw pixel input with a Convolutional Neural Network (CNN) for the RL agents, which might capture more nuances of the board state.
- Enhanced Evolutionary Algorithms:
- More sophisticated crossover and mutation operators for GA.
- Adaptive sigma/learning rates for ES.
- Comparative Analysis: More rigorous comparison of learning curves, final performance, and sample efficiency across all agents.
- Code Refinements: Move shared helper functions (like model saving/loading utilities) into a common
utils.py
file.
- Deep Q-Network (DQN): Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
- REINFORCE Algorithm: Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3), 229-256.
- Asynchronous Advantage Actor-Critic (A3C/A2C): Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.
- Proximal Policy Optimization (PPO): Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- NeuroEvolution of Augmenting Topologies (NEAT): Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2), 99-127.
- vietnh1009 Tetris Game Implementation: An implementation of Tetris which inspired this project. (Often found at: vietnh1009/Tetris-deep-Q-learning-pytorch
- Lectures, provided code examples, and guidance from the Aktuelle Data Science Entwicklungen II - Reinforcement Learning course at DHBW Mannheim, instructed by Janina Patzer.
- PyTorch: Deep learning framework used for neural network implementations.
- Documentation: https://pytorch.org/docs/stable/index.html
- NumPy: Fundamental package for numerical computation in Python.
- Documentation: https://numpy.org/doc/stable/
- OpenCV (cv2): Library for computer vision, used here for rendering the Tetris game and video generation.
- Documentation: https://docs.opencv.org/
- Python Bindings: Often referred to as
cv2
.
- Matplotlib: Plotting library, potentially used for visualizing training progress or results.
- Documentation: https://matplotlib.org/stable/contents.html
- NEAT (neat-python) Library for implementing the NEAT algorithm.
- Documentation: https://neat-python.readthedocs.io/en/latest/