Real-time speech-to-text using OpenAI Whisper (faster-whisper). Features browser audio capture, WebSocket streaming, and customizable display settings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.4 KiB
5.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Live Captions is a Dockerized web application that provides real-time speech-to-text captions using OpenAI's Whisper model (via faster-whisper). It captures microphone audio in the browser, streams it to a Flask backend for transcription, and displays captions with customizable styling.
Commands
Development
# Build and run (primary development command)
docker compose up --build
# Run in background
docker compose up -d --build
# View logs
docker compose logs -f
# Stop
docker compose down
# Reset all data (database + cached models)
docker compose down -v
First-time setup
cp .env.example .env
docker compose up --build
Architecture
Browser Docker Container
┌─────────────────────┐ ┌─────────────────────────────┐
│ MediaRecorder API │ │ Flask + Flask-SocketIO │
│ (1.5s audio chunks)│ ──────► │ (app.py) │
│ │ WebSocket│ │ │
│ Caption Display │ ◄────── │ faster-whisper transcriber │
│ (word-by-word) │ │ (transcriber.py) │
│ │ │ │ │
│ Settings Panel │ ──────► │ SQLite settings persistence│
│ │ REST API│ (database.py) │
└─────────────────────┘ └─────────────────────────────┘
Data Flow
- Browser captures mic audio using MediaRecorder, sends base64-encoded WebM chunks every 1.5s via WebSocket
- Backend converts WebM→WAV using pydub/ffmpeg, transcribes with faster-whisper
- Transcribed text sent back via WebSocket
transcriptionevent - Frontend animates words appearing one-by-one for streaming effect
Key Files
- app.py: Flask server with SocketIO WebSocket handlers and REST API for settings
- transcriber.py: Whisper model loading and audio transcription (singleton model instance)
- database.py: SQLite CRUD for user display preferences
- static/js/app.js: Audio capture, WebSocket client, word animation queue
- static/js/settings.js: Settings panel UI and persistence
Configuration
Environment variables in .env:
WHISPER_MODEL: Model size (tiny/base/small/medium/large) - affects accuracy vs speedWHISPER_DEVICE: cpu or cudaWHISPER_COMPUTE_TYPE: int8/float16/float32
User display settings stored in SQLite (data/settings.db):
- Font family, size, weight, color
- Background color, opacity, border radius, padding
- Max words (controls caption buffer length)
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Main UI |
/api/health |
GET | Health check |
/api/settings |
GET/PUT | Read/update user settings |
/api/settings/reset |
POST | Reset to defaults |
WebSocket Events
| Event | Direction | Payload |
|---|---|---|
audio_data |
client→server | {audio: base64, format: 'webm'} |
transcription |
server→client | {text: string} |
settings_updated |
server→client | settings object |
Volumes
./data:/app/data- SQLite database persistencewhisper-models- Cached Whisper model files (~140MB for base)
NVIDIA GPU Support
GPU acceleration significantly improves transcription speed. Follow these steps to enable it.
Prerequisites
- NVIDIA GPU with CUDA support
- NVIDIA driver installed (
nvidia-smishould work) - Docker installed
Install NVIDIA Container Toolkit
# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify installation
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Configure for GPU
- Update
.env:
WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16
- Run with GPU support:
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build
GPU Compute Types
| Type | Speed | Memory | Notes |
|---|---|---|---|
float16 |
Fast | Medium | Recommended for most GPUs |
int8_float16 |
Faster | Lower | Good balance |
float32 |
Slower | Higher | Maximum precision |
Troubleshooting
- "could not select device driver": NVIDIA Container Toolkit not installed or Docker not restarted
- CUDA out of memory: Try a smaller model (
WHISPER_MODEL=smallortiny) - Verify GPU access:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi