Generate clean, stylized burned-in captions for your videos using:
- Per-word timestamps
- An ASS subtitle file you can burn into the final video with FFmpeg
- Per-word timing: ideal for “karaoke” style highlights
- Stylable captions: control font, size, outline, margins, pop effect, shadows
- Burn-in ready: simple FFmpeg command to embed captions
- Python 3.9+ (3.10/3.11 recommended)
- FFmpeg installed and added to PATH
- macOS:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg - Windows: Download FFmpeg builds and add
binto PATH
- macOS:
- Python packages from
requirements.txt
git clone https://github.com/nikhil-reddy05/auto-captions.git
cd auto-captions
# (Optional) create a virtual environment
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
pip install -r requirements.txtPlace your input video in the repo root as
input_video.mp4(or change the path/arguments accordingly).
python caption_generator.pyThis extracts audio, runs transcription, and writes:
temp/word_timestamps.json
The JSON is a flat list like:
[
{ "word": "hello", "start": 0.32, "end": 0.61 },
{ "word": "world", "start": 0.62, "end": 0.95 }
]python json_to_ass.py --json temp/word_timestamps.json --out captions.assCommon flags
--words-per-cap– words per caption block (e.g.3)--font/--fontsize– typography (e.g.Montserrat,72)--outline/--shadow– readability (e.g. outline7, shadow0)--margin-v/--margin-lr– vertical / horizontal margins from frame edge
List all options:
python json_to_ass.py -hffmpeg -y -i input_video.mp4 -vf "ass=captions.ass" -c:v libx264 -crf 18 -preset medium -c:a copy output_with_captions.mp4If your chosen font doesn’t render, point FFmpeg to a fonts folder:
ffmpeg -y -i input_video.mp4 -vf "ass=captions.ass:fontsdir=./fonts:shaping=harfbuzz" -c:v libx264 -crf 18 -preset medium -c:a copy output_with_captions.mp4- Set ASS
PlayResX=1080andPlayResY=1920(vertical frame). - Keep captions above UI bars: try
--margin-varound120–180. - For busy backgrounds, use white text + thick black outline (
--outline 7–9). - If speech is very fast, slightly reduce
--fontsizeor increase--words-per-cap.
- Font not found / wrong font: Install the font system‑wide or use
fontsdiras shown above. Ensure the ASS style uses the exact family name. - Text looks tiny or misplaced: Ensure the ASS header’s
PlayResX/PlayResYmatches your target resolution (e.g., 1080×1920). - Transcription is slow on CPU: Use a smaller Whisper model in
caption_generator.py(e.g.,base/small). - Captions overlap platform UI: Increase
--margin-vor nudge the first word’sstartto0.01.
MIT — see LICENSE
This project was inspired by the Captions repository, which provided foundational ideas for stylized caption rendering and subtitle formatting. Our version expands on these concepts by integrating:
- Whisper-based per-word timestamp extraction
- Dynamic ASS karaoke-style captions with pop-in animations
- Full customization via CLI arguments (fonts, outlines, animation timings)