Fake news is a major issue in digital journalism and social media. The goal of this project is to automatically classify news articles as "Fake" or "Real" using Natural Language Processing (NLP) and Machine Learning/Deep Learning models.
- Dataset Source: Kaggle β Fake and Real News Dataset
- Total Rows: ~44,000 articles
- Columns:
title,text,subject,date,label
- Merged and labeled
True.csvandFake.csv - Cleaned HTML, URLs, punctuation, stopwords
- Analyzed text length, word count, NER entities
- Visualized class balance, word clouds, and NER distribution
- TF-IDF Vectorization
- Word2Vec Embeddings
- GloVe 100d Pretrained Embeddings (for Deep Learning)
- Logistic Regression (with GridSearchCV)
- NaΓ―ve Bayes
- Support Vector Machine (SVM)
- Random Forest
- Bidirectional LSTM with:
- Pretrained GloVe embeddings (100d)
- Two stacked BiLSTM layers
- Dropout & L2 regularization
- EarlyStopping and ReduceLROnPlateau
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix
- Cross-validation (for classical models)
| Model | Accuracy |
|---|---|
| Logistic Regression | 98.7% |
| NaΓ―ve Bayes | 93.3% |
| SVM | 99.4% |
| Random Forest | 99.8% |
| BiLSTM (GloVe) | 99.9% β |
β οΈ Note: Models trained on padded & tokenized text with proper validation strategy.
- Languages: Python
- Libraries: NumPy, pandas, scikit-learn, TensorFlow/Keras, NLTK, Matplotlib, Seaborn
- Embeddings: GloVe (100d), Word2Vec
- EDA Tools: SpaCy, WordCloud
- Version Control: Git, GitHub
Can be deployed via:
- Flask API for local predictions
- Streamlit web app for UI
- Integrate BERT or DistilBERT for transformer-based classification
- Add SHAP/LIME for explainability
- Self-learning (active learning loop)
- Streamlit UI for production-level deployment
- Deploy on Render / HuggingFace Spaces / GCP
- GloVe: Global Vectors for Word Representation
- Kaggle Dataset
- Inspired by the need for factual integrity in online journalism.