# Live Captions Real-time speech-to-text captions displayed in a customizable browser window, running entirely locally using OpenAI's Whisper model. ## Features - **Local Processing**: All transcription happens on your machine - no data sent to external services - **Real-time Captions**: Audio captured and transcribed in small chunks for near-instant feedback - **Customizable Display**: Adjust font, colors, size, background opacity, and more - **Recording Support**: Save caption sessions as markdown files - **GPU Acceleration**: Optional NVIDIA GPU support for faster transcription - **Docker-based**: Easy deployment with minimal setup ## Quick Start ### Prerequisites - Docker and Docker Compose installed - Microphone access in browser ### Installation 1. Clone the repository: ```bash git clone cd live-captions ``` 2. Create your environment file: ```bash cp .env.example .env ``` 3. Build and run: ```bash docker compose up --build ``` 4. Open http://localhost:5000 in your browser 5. Click "Start" and allow microphone access ## Configuration ### Environment Variables Edit `.env` to customize: | Variable | Default | Description | |----------|---------|-------------| | `WHISPER_MODEL` | `base` | Model size: `tiny`, `base`, `small`, `medium`, `large` | | `WHISPER_DEVICE` | `cpu` | Processing device: `cpu` or `cuda` | | `WHISPER_COMPUTE_TYPE` | `int8` | Precision: `int8`, `float16`, `float32` | | `PORT` | `5000` | Server port | | `AUDIO_CHUNK_DURATION` | `3` | Seconds of audio per chunk | ### Model Sizes | Model | Size | Speed | Accuracy | RAM Required | |-------|------|-------|----------|--------------| | `tiny` | 39M | Fastest | Lower | ~1GB | | `base` | 74M | Fast | Good | ~1GB | | `small` | 244M | Medium | Better | ~2GB | | `medium` | 769M | Slower | High | ~5GB | | `large` | 1550M | Slowest | Highest | ~10GB | ### Display Settings Access the settings panel in the web UI to customize: - Font family, size, and weight - Text and background colors - Background opacity and border radius - Maximum words displayed Settings persist in a local SQLite database. ## Docker Commands ```bash # Build and run docker compose up --build # Run in background docker compose up -d --build # View logs docker compose logs -f # Stop docker compose down # Reset all data (database + cached models) docker compose down -v ``` ## NVIDIA GPU Support GPU acceleration significantly improves transcription speed (3-10x faster than CPU). ### Prerequisites 1. NVIDIA GPU with CUDA support 2. NVIDIA driver installed (verify with `nvidia-smi`) 3. Docker installed ### Install NVIDIA Container Toolkit ```bash # Add NVIDIA package repository curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Install the toolkit sudo apt-get update sudo apt-get install -y nvidia-container-toolkit # Configure Docker to use NVIDIA runtime sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # Verify installation docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi ``` ### Enable GPU Mode 1. Update `.env`: ```env WHISPER_DEVICE=cuda WHISPER_COMPUTE_TYPE=float16 ``` 2. Run with GPU compose file: ```bash docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build ``` ### GPU Compute Types | Type | Speed | Memory | Notes | |------|-------|--------|-------| | `float16` | Fast | Medium | Recommended for most GPUs | | `int8_float16` | Faster | Lower | Good balance of speed/memory | | `float32` | Slower | Higher | Maximum precision | ### GPU Troubleshooting - **"could not select device driver"**: NVIDIA Container Toolkit not installed or Docker not restarted - **CUDA out of memory**: Try a smaller model (`WHISPER_MODEL=small` or `tiny`) - **Verify GPU access**: ```bash docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi ``` ## Architecture ``` Browser Docker Container ┌─────────────────────┐ ┌─────────────────────────────┐ │ MediaRecorder API │ │ Flask + Flask-SocketIO │ │ (audio chunks) │ ──────► │ (app.py) │ │ │ WebSocket│ │ │ │ Caption Display │ ◄────── │ faster-whisper transcriber │ │ (word-by-word) │ │ (transcriber.py) │ │ │ │ │ │ │ Settings Panel │ ──────► │ SQLite settings persistence│ │ │ REST API│ (database.py) │ └─────────────────────┘ └─────────────────────────────┘ ``` ### Data Flow 1. Browser captures microphone audio using MediaRecorder API 2. Audio sent as base64-encoded WebM chunks via WebSocket 3. Backend converts WebM to WAV using pydub/ffmpeg 4. faster-whisper transcribes audio to text 5. Text sent back via WebSocket 6. Frontend displays words with animation effect ## API Reference ### REST Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/` | GET | Main UI | | `/api/health` | GET | Health check | | `/api/settings` | GET | Get current settings | | `/api/settings` | PUT | Update settings | | `/api/settings/reset` | POST | Reset to defaults | | `/api/recordings` | GET | List saved recordings | | `/api/recordings/` | GET | Get recording content | | `/api/recordings/` | DELETE | Delete recording | ### WebSocket Events | Event | Direction | Payload | |-------|-----------|---------| | `audio_data` | client → server | `{audio: base64, format: 'webm'}` | | `transcription` | server → client | `{text: string}` | | `settings_updated` | server → client | settings object | | `start_recording` | client → server | - | | `stop_recording` | client → server | - | ## Data Persistence | Location | Content | |----------|---------| | `./data/` | SQLite database for settings | | `./recordings/` | Saved caption sessions (markdown) | | `whisper-models` volume | Cached Whisper model files | ## License MIT