Skip to content

A comprehensive framework for detecting hate speech in the low-resource Telugu language using multimodal (audio and text) analysis

Notifications You must be signed in to change notification settings

AjaySurya-018/Multimodal-HateSpeech-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎧 Audio-Driven Hate Speech Detection in Telugu

Low-resource multimodal hate speech detection leveraging acoustic and textual representations for robust moderation in Telugu.


🚀 Overview

While hate speech detection has progressed rapidly for English, Telugu — with over 83 million speakers — still lacks annotated resources. This project introduces the first multimodal Telugu hate speech dataset and a suite of audio-, text-, and fusion-based models for comprehensive detection.

🧠 Core Highlights

  • 🗂️ First Telugu hate-speech dataset (2 hours of annotated audio–text pairs).

  • 🔊 Multimodal pipeline integrating acoustic and textual cues.

  • ⚙️ Evaluated OpenSMILE, Wav2Vec2, LaBSE, and XLM-R baselines.

  • 🎯 Achieved 91 % accuracy (audio) and 89 % (text); fusion improved robustness.


🧩 Abstract

This study fills a critical resource gap in Telugu hate-speech detection. A manually annotated 2-hour multimodal dataset was curated from YouTube. Acoustic (OpenSMILE + SVM) and textual (LaBSE) models achieved 91 % and 89 % accuracy, respectively. Fusion approaches highlight the complementary role of vocal prosody and linguistic cues.


🎯 Problem Statement

Challenge Description
🗣️ Low-Resource Gap Telugu lacks labeled corpora and pretrained models for hate-speech detection.
🔊 Modality Gap Text-only systems ignore vocal signals (tone, sarcasm, aggression).

💡 Goal: Develop a multimodal framework combining speech and text for richer, context-aware classification.


📊 Dataset: DravLangGuard

Attribute Description
Source YouTube (≥ 50 K subscribers)
Annotators 3 native Telugu postgraduates
Classes Hate / Non-Hate (4 sub-types for Hate)
Inter-Annotator Agreement 0.79 (Cohen’s κ)

Dataset Composition

The dataset is balanced between hate and non-hate content, with a detailed breakdown of hate speech categories.

Class Sub-Class Short Label No. of Samples Total Duration (min)
Hate (H) Gender G 111 15.75
Religion R 82 15.49
Political / Nationality P 68 14.90
Personal Defamation C 133 14.90
Non-Hate (NH) Non-Hate N 208 60.00

🧰 Preprocessing Pipeline

🎵 Audio

  • Resampled (16 kHz / 48 kHz), loudness-normalized, duration-filtered.

  • Extracted OpenSMILE ComParE 2016 LLDs → statistical aggregates.

  • Augmentation: time-shift, stretch, Gaussian noise (class balancing).

✍️ Text

  • Tokenization & contextual embeddings (LaBSE, mBERT, XLM-R).

  • Sequence truncation (128–512 tokens).

  • Handles transliterated Telugu effectively via LaBSE multilingual alignment.


🧮 Methodology

🔹 Unimodal Pipelines

Modality Methods
🧾 Text TF-IDF + XGBoost / SVM · Transformer fine-tuning (mBERT, XLM-R, LaBSE)
🔊 Audio OpenSMILE + SVM / RF / XGBoost / MLP · Wav2Vec2 (XLS-R, Indic) · AST · LSTM/1D-CNN sequence heads

🔸 Multimodal Fusion Strategies

Strategy Description Implemented Models
Early Fusion (Feature-Level) Concatenate / project audio & text embeddings into shared latent space before joint classification. OpenSMILE + LaBSE (Attention), Wav2Vec2 + XLM-R, CLAP (joint audio-text encoders)
Late Fusion (Decision-Level) Independent modality-specific classifiers; logits combined (averaging / voting). OpenSMILE + LaBSE, Wav2Vec2 + XLM-R
Intermediate / Cross-Attention Audio & text projected; multi-head attention exchanges contextual cues before pooling. Custom cross-attention head (OpenSMILE ↔ LaBSE)

Model Components

  • Projection Layers: LayerNorm + Linear → common_dim (256–512) + ReLU.
  • Cross-Attention: MultiHeadAttention (n_heads=2–8) on modality token pair sequence.
  • Classifier: Dropout (p=0.3–0.4) + Linear stack → softmax.
  • Optimization: AdamW (lr 1e-5–2e-4), weight decay 0.01, stratified 80/20 splits.
  • Metrics: Accuracy, Macro F1, Confusion Matrix, Class-wise Precision/Recall.

📈 Results Summary

Text Based Model Performance

Speech-Based Model Performance

Multimodals (Early-Fusion) Performance

Multimodals (Late-Fusion) Performance


Key Insights

  • Acoustic prosody (energy, voicing, MFCC dynamics) effectively disambiguates implicit or sarcastic hate where lexical tokens alone are ambiguous.
  • Lightweight classical models on engineered audio features can outperform large pretrained transformers in constrained low-resource settings (data scarcity + overfitting risk).
  • Cross-attention fusion provides richer inter-modality interaction but incurs higher computational overhead; late fusion remains robust when one modality degrades.
  • LaBSE’s multilingual alignment aids transliterated Telugu tokens versus vanilla mBERT, improving binary discrimination.
  • Multimodal gains are modest in balanced settings—suggesting future improvements via temporal alignment (utterance-level segmentation) & noise-robust ASR augmentation.

🧠 Tech Stack

Category Tools / Models
Audio OpenSMILE · Wav2Vec2 · AST . CLAP
Text LaBSE · XLM-R · mBERT . TF-IDF
ML / DL SVM · XGBoost · PyTorch · Transformers . RandomForest
Evaluation Accuracy · Macro F1 · Confusion Matrix
Visualization Matplotlib · Seaborn
Audio Processing Librosa . Torchaudio

🙏 Citation

If you use this dataset or code in your research, please cite the original paper:

@inproceedings{kumar2024audio,
  title={Audio Driven Detection of Hate Speech in Telugu: Toward Ethical and Secure CPS},
  author={Kumar M, Santhosh and Ravula P, Sai and Teja M, Prasanna and Surya J, Ajay and V, Mohitha and Lal G, Jyothish},
  year={2024}
}

About

A comprehensive framework for detecting hate speech in the low-resource Telugu language using multimodal (audio and text) analysis

Topics

Resources

Stars

Watchers

Forks