Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
299 changes: 299 additions & 0 deletions ml-api/README_TOXICITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# SentiLog ML API - Toxicity Detection

This document describes the toxicity detection functionality that has been integrated into the SentiLog ML API system.

## 🎯 Overview

The system automatically analyzes news articles for toxicity levels using Google's Perspective API, in addition to the existing sentiment analysis. Each news article is scored for toxicity and stored with additional metadata for filtering and analysis.

## 🏗️ Architecture

### Services

1. **`toxicity_service.py`** - Core toxicity detection using Perspective API
2. **`perspective_service.py`** - Extended Perspective API wrapper with configurable attributes
3. **`news_service.py`** - Enhanced to include toxicity scoring during news fetching
4. **`db_service.py`** - MongoDB integration for storing toxicity data

### Data Flow

```
News API → fetch_news() → Sentiment Analysis (VADER) → Toxicity Analysis (Perspective) → MongoDB Storage
```

## 🔧 Configuration

### Required Environment Variables

```bash
# News API (required)
NEWS_API_KEY=your_news_api_key_here

# Google Perspective API (required)
PERSPECTIVE_API_KEY=your_perspective_api_key_here

# Optional Configuration
TOXICITY_THRESHOLD=0.7 # Threshold for marking content as toxic
PERSPECTIVE_TIMEOUT=6 # API request timeout in seconds
MONGO_URI=mongodb://localhost:27017 # MongoDB connection string
MONGO_DB=sentilog # Database name
NEWS_COLLECTION=news # Collection name
```

### Getting API Keys

1. **News API Key**:
- Visit [NewsAPI.org](https://newsapi.org/)
- Sign up for a free account
- Copy your API key

2. **Perspective API Key**:
- Visit [Google Cloud Console](https://console.cloud.google.com/)
- Enable the Perspective Comment Analyzer API
- Create credentials (API key)
- Copy your API key

## 📊 Data Structure

### News Article Schema

```json
{
"_id": "ObjectId(...)",
"title": "Article title",
"description": "Article description",
"content": "Full article content",
"url": "https://example.com/article",
"source": "Source name",
"publishedAt": "2025-01-15T10:30:00Z",
"sentiment": "Positive",
"toxicity": {
"score": 0.15,
"toxic": false,
"threshold": 0.7
},
"ingested_at": 1705320600
}
```

### Toxicity Object Fields

- **`score`** (float): Toxicity score from 0.0 (not toxic) to 1.0 (very toxic)
- **`toxic`** (boolean): Whether the content exceeds the toxicity threshold
- **`threshold`** (float): The threshold used for classification

## 🚀 API Endpoints

### 1. Fetch and Store News with Toxicity Analysis

```http
POST /ml-api/fetch-and-store
Content-Type: application/json

{
"query": "technology",
"page_size": 20,
"max_pages": 2,
"from_date": "2025-01-01"
}
```

**Response:**
```json
{
"status": "done",
"inserted": 35,
"updated": 5,
"errors": []
}
```

### 2. Get News with Toxicity Filtering

```http
GET /ml-api/news?limit=10&toxic=true
```

**Query Parameters:**
- `limit` (int): Number of articles to return (default: 20)
- `toxic` (string): Filter by toxicity level ("true" for toxic, "false" for non-toxic)

**Response:**
```json
{
"count": 10,
"results": [
{
"_id": "...",
"title": "Example Article",
"toxicity": {
"score": 0.85,
"toxic": true,
"threshold": 0.7
},
...
}
]
}
```

### 3. Standalone Toxicity Analysis

```http
POST /ml-api/analyze-toxicity
Content-Type: application/json

{
"text": "Text to analyze for toxicity"
}
```

**Response:**
```json
{
"score": 0.25,
"toxic": false,
"threshold": 0.7
}
```

## 🧪 Testing

### Run the Test Suite

```bash
python setup_and_test.py
```

This will test:
- Dependencies installation
- Environment variables
- Database connection
- Toxicity detection services
- News fetching pipeline
- API endpoints (if server is running)

### Manual Testing

1. **Start the server:**
```bash
python run.py
```

2. **Test toxicity analysis:**
```bash
curl -X POST http://localhost:5001/ml-api/analyze-toxicity \
-H "Content-Type: application/json" \
-d '{"text": "This is a test message"}'
```

3. **Fetch and analyze news:**
```bash
curl -X POST http://localhost:5001/ml-api/fetch-and-store \
-H "Content-Type: application/json" \
-d '{"query": "technology", "page_size": 5}'
```

4. **View results:**
```bash
curl http://localhost:5001/ml-api/news?limit=5
```

## 🔍 Understanding Toxicity Scores

### Score Interpretation

- **0.0 - 0.2**: Very low toxicity (safe content)
- **0.2 - 0.4**: Low toxicity (mostly safe)
- **0.4 - 0.6**: Moderate toxicity (review recommended)
- **0.6 - 0.8**: High toxicity (likely problematic)
- **0.8 - 1.0**: Very high toxicity (definitely problematic)

### Default Threshold

The default threshold is set to **0.7**, meaning content with a score ≥ 0.7 is marked as toxic.

## 🚨 Error Handling

The system handles various error scenarios:

1. **API Rate Limits**: Includes sleep delays between requests
2. **Network Errors**: Graceful fallback with error logging
3. **Invalid Content**: Empty or null text returns score 0.0
4. **API Key Issues**: Clear error messages for missing credentials

### Error Response Example

```json
{
"toxicity": {
"error": "Perspective API request failed: 403 Forbidden"
}
}
```

## 📈 Performance Considerations

1. **Rate Limiting**: The Perspective API has usage limits
2. **Caching**: Consider implementing caching for repeated content
3. **Batch Processing**: Process articles in batches to manage API quotas
4. **Timeout Settings**: Configured timeouts prevent hanging requests

## 🔧 Customization

### Adding More Toxicity Attributes

The Perspective API supports multiple attributes. To analyze additional ones:

1. Modify `perspective_service.py`:
```python
attributes = ["TOXICITY", "SEVERE_TOXICITY", "IDENTITY_ATTACK", "INSULT"]
```

2. Update the database schema to store additional scores

### Adjusting Thresholds

Modify the `TOXICITY_THRESHOLD` environment variable or update the service logic for dynamic thresholds.

## 🐛 Troubleshooting

### Common Issues

1. **"PERSPECTIVE_API_KEY not set"**
- Ensure the API key is set in your `.env` file
- Verify the Perspective API is enabled in Google Cloud Console

2. **High toxicity scores on normal content**
- Check the content being analyzed
- Consider adjusting the threshold
- Verify the API is working correctly

3. **Database connection errors**
- Ensure MongoDB is running
- Check the `MONGO_URI` configuration
- Verify database permissions

4. **News API quota exceeded**
- Check your NewsAPI usage limits
- Consider using a paid plan for higher limits
- Implement request throttling

## 🔄 Future Enhancements

Potential improvements to consider:

1. **Multi-language Support**: Extend toxicity analysis to other languages
2. **Historical Analysis**: Track toxicity trends over time
3. **Content Moderation**: Automatically flag or filter toxic content
4. **Dashboard**: Web interface for toxicity analytics
5. **Machine Learning**: Train custom models for domain-specific toxicity

## 📞 Support

For issues with the toxicity detection functionality:

1. Run the test suite: `python setup_and_test.py`
2. Check the error logs in the console
3. Verify API keys and environment configuration
4. Review the Google Perspective API documentation
66 changes: 54 additions & 12 deletions ml-api/app/__init__.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,64 @@
from flask import Flask
from flask import Flask, jsonify
from flask_cors import CORS
from dotenv import load_dotenv
from pymongo import MongoClient
import os

load_dotenv()
def create_app():
"""Simple Flask app factory without config classes"""

def create_app(config_object=None):
"""Application factory that configures Flask app and routes."""
app = Flask(__name__)

# Set config directly
app.config['SECRET_KEY'] = os.environ.get('SECRET_KEY', 'dev-secret-key-change-in-production')
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max file size
cors_origins = os.environ.get('CORS_ORIGINS', '*').split(',')
app.config['SECRET_KEY'] = os.environ.get(
'SECRET_KEY',
'dev-secret-key-change-in-production'
)
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024

# Initialize CORS
cors_origins = os.environ.get('CORS_ORIGINS', '*').split(',')
CORS(app, resources={r"/*": {"origins": cors_origins}})

# Register blueprints
from app.routes import routes
app.register_blueprint(routes.bp)
if config_object:
app.config.from_object(config_object)

from app.routes import routes as vader_routes
app.register_blueprint(vader_routes.bp)

try:
from app.routes import perspective_routes
app.register_blueprint(perspective_routes.bp)
except ImportError:

pass


mongo_client = MongoClient("mongodb://localhost:27017/")
db = mongo_client["news_db"]
collection = db["articles"]

@app.route("/news", methods=["GET"])
def get_news():
articles = list(collection.find().sort("_id", -1).limit(20))
for article in articles:
article["_id"] = str(article["_id"])
return jsonify(articles)
@app.route('/')
def root():
return {'message': 'SentiLog ML API is running — visit /ml-api/'}, 200

@app.errorhandler(400)
@app.errorhandler(404)
@app.errorhandler(500)
@app.errorhandler(502)
def custom_error_handler(error):
response = {
'error': {
'code': error.code,
'name': error.name,
'description': error.description,
}
}
return jsonify(response), error.code

return app
return app
Loading