This repository implements a neural network-based classifier to detect propaganda in textual data. The model is trained using GloVe word embeddings and a Multi-Layer Perceptron (MLP) architecture. It predicts whether a given sentence belongs to the class of "propaganda" or "non-propaganda".
- Pre-trained GloVe embeddings for semantic word representations.
- Robust neural architecture using PyTorch.
- Pickle-based model serialization for easy inference.
- Simple and efficient sentence vectorization using GloVe.
- CLI-based inference for predicting individual sentences.
- PyTorch: Deep learning framework used for building and training the MLP classifier.
- Gensim: For loading and utilizing pre-trained GloVe word embeddings.
- NLTK: Tokenizer for breaking sentences into words.
- NumPy: Efficient numerical computations for vector operations.
- Scikit-learn: For evaluating the model (precision, recall, F1-score).
- Pandas: Data manipulation during preprocessing.
- GloVe (Global Vectors for Word Representation):
- Version:
glove.6B.300d.txt
. - Provides 300-dimensional dense vector representations for English words.
- Version:
- Textual data provided in
train.tsv
.- Contains sentences, article titles, and their corresponding labels (
propaganda
ornon-propaganda
).
- Contains sentences, article titles, and their corresponding labels (
- Pickle:
- Saves the trained model weights, input parameters, and OOV vector for efficient inference.
git clone https://github.com/ash-sha/propaganda-detection.git
cd propaganda-detection
Ensure Python 3.8+ is installed. Then install required libraries:
pip install -r requirements.txt
- Download
glove.6B.300d.txt
from the GloVe website. - Place the file in the appropriate directory (e.g.,
./glove.6B.300d.txt
).
(Optional) To retrain the model from scratch:
train.ipynb
- Ensure the model file (
propoganda.pickle
) is present in the working directory. - Run the inference script:
test.ipynb
- Enter your sentence when prompted, and get the prediction:
Enter the query: The government is spreading fake news to mislead the public.
Predicted label: propaganda
propaganda-detection/
│
├── glove.6B.300d.txt # Pre-trained GloVe embeddings
├── train.tsv # Dataset used for training
├── propoganda.pickle # Serialized trained model
├── train.ipynb # Script for training the model
├── test.ipynb # Script for running inference
├── requirements.txt # Python dependencies
├── README.md # Project documentation
-
Data Preprocessing:
- Load sentences and labels from the dataset.
- Shuffle and split data into training, validation, and testing sets.
- Normalize and vectorize sentences using GloVe embeddings.
-
Model Architecture:
- Multi-Layer Perceptron (MLP) with one hidden layer.
- Dropout regularization to prevent overfitting.
-
Training:
- Optimize using the Adam optimizer and CrossEntropy loss.
- Evaluate on the validation set after each epoch.
- Save the model with the highest validation F1-score.
-
Inference:
- Load the trained model (
propoganda.pickle
). - Use GloVe embeddings to vectorize the input query.
- Predict the class label for the input sentence.
- Load the trained model (
- Input Query:
The government is spreading fake news to mislead the public.
- Predicted Label:
propaganda
- Add support for multi-class propaganda classification (e.g., detecting specific propaganda techniques).
- Improve sentence vectorization by integrating contextual embeddings (e.g., BERT, RoBERTa).
- Implement a web or mobile interface for real-time predictions.
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License. See LICENSE
for details.