This repository contains Python implementations of 31 different text summarization models, ranging from basic statistical approaches to advanced transformer-based architectures. This project serves as an educational resource for understanding various text summarization techniques and their practical applications.
Text summarization is the process of generating a shorter version of a longer text while preserving its most important information. This repository demonstrates both extractive (selecting key sentences) and abstractive (generating new content) summarization approaches.
All models are evaluated using:
- ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)
- Precision: Measures how much of the generated summary is relevant
- Recall: Measures how much of the reference summary is captured
- F1-Score: Harmonic mean of precision and recall
- BLEU Score (Bilingual Evaluation Understudy)
- Measures the quality of machine-generated text against reference text
# Python 3.7+ required
python --version
# Install core dependencies
pip install -r requirements.txt
# Example: Using BERT for summarization
from summarizer import Summarizer
model = Summarizer()
text = "Your long text here..."
summary = model(text, num_sentences=3)
print(summary)
# Navigate to the model directory
cd "Basic to Advance Text Summarisation Models"
# Run any Jupyter notebook
jupyter notebook BERT.ipynb
Model | Type | ROUGE F1 | BLEU | Pros | Cons | Best For |
---|---|---|---|---|---|---|
GPT-2 | Abstractive | 0.797 | 0.423 | High recall, contextual | Low control, expensive | Research, complex docs |
LSA | Extractive | 0.602 | 0.869 | Semantic relationships | Needs large data | Multi-document |
TextRank | Extractive | 0.586 | 0.694 | Unbiased, consistent | Limited context | General purpose |
T5 | Abstractive | 0.573 | 0.683 | High accuracy, multilingual | Resource intensive | Production systems |
NLTK | Extractive | 0.589 | 0.869 | Simple, language-independent | Limited accuracy | Beginners |
BERT | Extractive | 0.545 | 0.677 | Context understanding | Extractive only | Context-heavy docs |
BART | Abstractive | 0.402 | 0.905 | State-of-the-art | Large model size | High-quality summaries |
LexRank | Extractive | 0.357 | 0.651 | Handles complex texts | Limited coverage | Long documents |
SumBasic | Extractive | 0.589 | 0.621 | Simple implementation | Limited coverage | Single documents |
These models use fundamental NLP techniques and statistical approaches:
- Approach: Frequency-based sentence scoring using NLTK library
- Pros: Simple implementation, language-independent, well-documented
- Cons: Limited accuracy for complex texts, resource-intensive
- Performance: ROUGE F1: 0.589, BLEU: 0.869
- Use Case: Educational purposes, simple documents
- Approach: Term Frequency-Inverse Document Frequency weighting
- Pros: Effective for identifying important terms, widely used
- Cons: May miss context, struggles with technical jargon
- Performance: Evaluated using standard TF-IDF scoring
- Use Case: Keyword extraction, document classification
- Approach: Probabilistic sentence selection based on word frequency
- Pros: Simple implementation, good for single documents
- Cons: Limited coverage, may lack coherence
- Performance: ROUGE F1: 0.589, BLEU: 0.621
- Use Case: Single document summarization
These models represent text as graphs and use centrality algorithms:
- Approach: Graph-based ranking using sentence similarity
- Pros: Automatic, unbiased, consistent results
- Cons: Limited context understanding, may miss nuances
- Performance: ROUGE F1: 0.586, BLEU: 0.694
- Use Case: General purpose summarization
- Approach: Graph-based lexical centrality using cosine similarity
- Pros: Handles complex texts, good performance on benchmarks
- Cons: Limited coverage, extractive only
- Performance: ROUGE F1: 0.357, BLEU: 0.651
- Use Case: Long documents, multi-document summarization
- Approach: Graph theory-based sentence selection
- Pros: Theoretical foundation, systematic approach
- Cons: Computational complexity, may not capture semantics
- Use Case: Research applications
These models use mathematical techniques to identify important content:
- Approach: Singular Value Decomposition for semantic relationships
- Pros: Captures semantic relationships, good for multi-document
- Cons: Requires large training data, limited coverage
- Performance: ROUGE F1: 0.602, BLEU: 0.869
- Use Case: Multi-document summarization, semantic analysis
- Approach: Matrix-based sentence importance scoring
- Pros: Systematic ranking, handles large documents
- Cons: May miss context, computational expense
- Use Case: Large document processing
These models group similar content together:
- Approach: K-means clustering of sentences
- Pros: Groups similar content, systematic approach
- Cons: Requires parameter tuning, may lose context
- Use Case: Topic-based summarization
- Approach: Hierarchical clustering for document structure
- Pros: Preserves document structure, systematic
- Cons: Computational complexity, parameter sensitivity
- Use Case: Structured document summarization
These models use sophisticated NLP techniques:
- Approach: Classic statistical approach using word frequency
- Pros: Historical significance, simple implementation
- Cons: Basic approach, limited effectiveness
- Use Case: Educational purposes, historical context
- Approach: Sentence compression and selection
- Pros: More concise summaries, preserves key information
- Cons: Complex implementation, may lose context
- Use Case: Length-constrained summaries
These are state-of-the-art deep learning models:
- Approach: Pre-trained BERT model for sentence extraction
- Pros: High accuracy, captures context well
- Cons: Limited to extractive, depends on pre-trained models
- Performance: ROUGE F1: 0.545, BLEU: 0.677
- Use Case: Context-heavy documents
- Approach: Bidirectional and Auto-Regressive Transformer
- Pros: State-of-the-art performance, high accuracy
- Cons: Resource-intensive, large model size
- Performance: ROUGE F1: 0.402, BLEU: 0.905
- Use Case: High-quality abstractive summaries
- Approach: Text-To-Text Transfer Transformer
- Pros: High accuracy, customizable, multilingual
- Cons: Resource-intensive, technical complexity
- Performance: ROUGE F1: 0.573, BLEU: 0.683
- Use Case: Production systems, multilingual applications
- Approach: Generative Pre-trained Transformer 2
- Pros: Large language model, contextual understanding
- Cons: Lack of control, high computational costs
- Performance: ROUGE F1: 0.797, BLEU: 0.423
- Use Case: Research, creative summarization
- Approach: Distilled version of BART for efficiency
- Pros: Faster inference, smaller model size
- Cons: Slightly lower accuracy than full BART
- Use Case: Real-time applications
- Approach: Topic modeling and document similarity
- Pros: Good for topic-based summarization
- Cons: May miss specific details
- Use Case: Topic extraction, document similarity
- Approach: Multi-document summarization toolkit
- Pros: Specialized for multi-document scenarios
- Cons: Limited to specific use cases
- Use Case: Multi-document summarization
- Approach: Custom transformer-based pipeline
- Pros: Customizable architecture, modern approach
- Cons: Complex implementation, requires expertise
- Use Case: Custom applications, research
The repository includes 12 different Pegasus model implementations:
- Approach: Single document abstractive summarization
- Pros: High-quality summaries, generalization capability
- Cons: Large computational requirements, limited control
- Performance: ROUGE F1: 0.086, BLEU: 0
- Use Case: Single document abstractive summarization
- Approach: Multi-document summarization
- Pros: Handles multiple documents, comprehensive coverage
- Cons: Complex processing, resource-intensive
- Use Case: Multi-document scenarios
Includes implementations of:
- Pegasus-XSUM: BBC News summarization
- Pegasus-CNN/DailyMail: News article summarization
- Pegasus-Newsroom: Newsroom dataset
- Pegasus-MultiNews: Multi-document news summarization
- Pegasus-Gigaword: Headline generation
- Pegasus-WikiHow: How-to article summarization
- Pegasus-Reddit-TIFU: Reddit story summarization
- Pegasus-BigPatent: Patent document summarization
- And more specialized variants
- Approach: Novel graph-based approach using LCS, TF-IDF, and matrix methods
- Pros: Combines multiple techniques, systematic approach
- Cons: Complex implementation, computational expense
- Performance: ROUGE F1: 0.417, BLEU: 0.901
- Use Case: Research, novel approaches
- Best Overall Performance: GPT-2 achieved the highest ROUGE F1 score (0.797)
- Best BLEU Score: BART achieved the highest BLEU score (0.905)
- Most Balanced: T5 shows good balance between ROUGE and BLEU scores
- Efficiency vs Accuracy: DistilBART offers a good trade-off for real-time applications
- Transformer models generally outperform traditional methods
- Abstractive models show higher BLEU scores but lower ROUGE scores
- Extractive models are more consistent but less creative
- Graph-based methods provide good baseline performance
The performance comparison table above provides detailed metrics for all models. These visualizations complement the comprehensive comparison table in the "Model Performance Comparison Table" section.
pip install transformers
pip install torch
pip install nltk
pip install rouge
pip install scikit-learn
pip install numpy
pip install pandas
pip install networkx
pip install sentencepiece
- spacy: For advanced NLP processing
- gensim: For topic modeling
- sumy: For multi-document summarization
- bert-extractive-summarizer: For BERT-based summarization
- RAM: Minimum 8GB, Recommended 16GB+
- GPU: Optional but recommended for transformer models
- Storage: At least 5GB free space for models
- Python: 3.7 or higher
This repository serves as a comprehensive learning resource for:
- Understanding Different Approaches: From basic statistical methods to advanced transformer models
- Performance Comparison: Real-world evaluation using standard metrics
- Implementation Examples: Complete working code for each approach
- Research References: Academic papers and research background for each model
- Practical Applications: Real-world use cases and considerations
- Start with NLTK and TF-IDF models
- Understand basic NLP concepts
- Learn about evaluation metrics
- Explore graph-based methods (TextRank, LexRank)
- Study matrix decomposition (LSA)
- Understand clustering approaches
- Dive into transformer models (BERT, BART, T5)
- Experiment with Pegasus models
- Implement custom approaches
- NLTK: Educational purposes, basic understanding
- TF-IDF: Keyword extraction, document classification
- SumBasic: Single document summarization
- BERT: Context-heavy documents
- BART: High-quality abstractive summaries
- T5: Production systems, multilingual
- LSA: Semantic analysis, multi-document
- LexRank: Long documents, systematic approach
- Multi-Document Pegasus: Advanced multi-document scenarios
- DistilBART: Fast inference, good quality
- TextRank: General purpose, consistent
- TF-IDF: Simple, fast processing
- GPT-2: Creative summarization, research
- Advanced transformer models: State-of-the-art approaches
- Graph-Based Summary: Novel methodologies
This project includes:
- 31 different summarization approaches
- Comprehensive performance evaluation
- Educational implementations
- Research paper references
- Practical considerations and trade-offs
- Extractive vs Abstractive Summarization
- Statistical vs Neural Approaches
- Graph-based vs Matrix-based Methods
- Traditional NLP vs Deep Learning
- Single vs Multi-document Summarization
# If you encounter CUDA issues
pip install torch --index-url https://download.pytorch.org/whl/cpu
# For memory issues with large models
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
- Ensure sufficient RAM for large models
- Use CPU versions for memory-constrained environments
- Check model file paths and permissions
- Use smaller models for faster inference
- Consider batch processing for multiple documents
- Implement caching for repeated summarizations
Contributions are welcome! Areas for improvement:
- Additional model implementations
- Performance optimizations
- Documentation improvements
- New evaluation metrics
- Domain-specific adaptations
- Fork the repository
- Create a feature branch
- Add your implementation with proper documentation
- Include performance metrics
- Submit a pull request
This repository was created by me as part of an Academic Capstone Project at Vellore Institute of Technology. Special thanks to Prof. Durgesh Kumar and the faculty members for their guidance and support.
- Hugging Face for transformer model implementations
- NLTK team for natural language processing tools
- Academic community for research papers and methodologies
This project is for educational purposes. Please respect the licenses of individual model implementations and research papers referenced.
- Educational Use: Free for educational and research purposes
- Commercial Use: Check individual model licenses
- Attribution: Please cite the original research papers
For questions, issues, or contributions:
- Issues: Use GitHub issues for bug reports
- Discussions: Use GitHub discussions for questions
- Contributions: Submit pull requests for improvements
Note: This repository is designed for educational purposes and research. For production use, please ensure proper licensing and consider the specific requirements of your use case.
- Last Updated: December 2024
- Total Models: 31
- Python Version: 3.7+
- License: Educational Use