This project focuses on Sentiment Analysis of IMDb movie reviews using Natural Language Processing (NLP) techniques. The goal is to classify movie reviews as either positive or negative based on their content.
The sentiment analysis model uses IMDb movie reviews dataset to classify reviews into two categories: positive and negative. The project demonstrates the usage of TF-IDF vectorization and Logistic Regression as the primary methods for feature extraction and model training. The model is evaluated using metrics like accuracy, precision, recall, and F1-score.
- Preprocessing: Text cleaning, tokenization, stopword removal, and lemmatization.
- Vectorization: TF-IDF (Term Frequency-Inverse Document Frequency) for feature extraction.
- Modeling: Logistic Regression for classification.
- Evaluation: Performance metrics for classification accuracy and precision.
- Python 3.8+
- Libraries:
nltk
(for text preprocessing)sklearn
(for machine learning and model evaluation)pandas
(for data manipulation)matplotlib
(for visualization)
- Machine Learning Algorithms: Logistic Regression
The dataset used in this project comes from the IMDb Movie Reviews dataset, which contains labeled movie reviews, with each review classified as either positive or negative.
- Training Data: Contains 12,500 positive reviews and 12,500 negative reviews.
- Testing Data: Similarly balanced with 12,500 positive and 12,500 negative reviews.
The data is processed using TF-IDF vectorization to transform text data into numerical features suitable for machine learning.
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/YousefAlaaAli/sentiment-analysis-imdb.git cd sentiment-analysis-imdb
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
To run the sentiment analysis:
- Make sure all dependencies are installed.
- Run the Python script to train and evaluate the model:
python sentiment_analysis.py
After running the script, you will get the model's performance metrics (accuracy, precision, recall, and F1-score).
The model achieves the following performance on the test set:
- Accuracy: 87.88%
- Precision: 0.88 (for both positive and negative classes)
- Recall: 0.88 (for both positive and negative classes)
- F1-Score: 0.88 (for both positive and negative classes)
This indicates that the model performs very well on both positive and negative reviews.
Contributions are welcome! Please fork the repository and submit a pull request if you have any improvements, fixes, or new features you'd like to add.
- Fork the repository
- Create a new branch (
git checkout -b feature-branch
) - Make your changes
- Commit your changes (
git commit -am 'Add new feature'
) - Push to the branch (
git push origin feature-branch
) - Create a new Pull Request