Real-time speech-to-text using OpenAI Whisper (faster-whisper). Features browser audio capture, WebSocket streaming, and customizable display settings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
155 lines
5.4 KiB
Markdown
155 lines
5.4 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
Live Captions is a Dockerized web application that provides real-time speech-to-text captions using OpenAI's Whisper model (via faster-whisper). It captures microphone audio in the browser, streams it to a Flask backend for transcription, and displays captions with customizable styling.
|
|
|
|
## Commands
|
|
|
|
### Development
|
|
```bash
|
|
# Build and run (primary development command)
|
|
docker compose up --build
|
|
|
|
# Run in background
|
|
docker compose up -d --build
|
|
|
|
# View logs
|
|
docker compose logs -f
|
|
|
|
# Stop
|
|
docker compose down
|
|
|
|
# Reset all data (database + cached models)
|
|
docker compose down -v
|
|
```
|
|
|
|
### First-time setup
|
|
```bash
|
|
cp .env.example .env
|
|
docker compose up --build
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Browser Docker Container
|
|
┌─────────────────────┐ ┌─────────────────────────────┐
|
|
│ MediaRecorder API │ │ Flask + Flask-SocketIO │
|
|
│ (1.5s audio chunks)│ ──────► │ (app.py) │
|
|
│ │ WebSocket│ │ │
|
|
│ Caption Display │ ◄────── │ faster-whisper transcriber │
|
|
│ (word-by-word) │ │ (transcriber.py) │
|
|
│ │ │ │ │
|
|
│ Settings Panel │ ──────► │ SQLite settings persistence│
|
|
│ │ REST API│ (database.py) │
|
|
└─────────────────────┘ └─────────────────────────────┘
|
|
```
|
|
|
|
### Data Flow
|
|
1. Browser captures mic audio using MediaRecorder, sends base64-encoded WebM chunks every 1.5s via WebSocket
|
|
2. Backend converts WebM→WAV using pydub/ffmpeg, transcribes with faster-whisper
|
|
3. Transcribed text sent back via WebSocket `transcription` event
|
|
4. Frontend animates words appearing one-by-one for streaming effect
|
|
|
|
### Key Files
|
|
- **app.py**: Flask server with SocketIO WebSocket handlers and REST API for settings
|
|
- **transcriber.py**: Whisper model loading and audio transcription (singleton model instance)
|
|
- **database.py**: SQLite CRUD for user display preferences
|
|
- **static/js/app.js**: Audio capture, WebSocket client, word animation queue
|
|
- **static/js/settings.js**: Settings panel UI and persistence
|
|
|
|
## Configuration
|
|
|
|
Environment variables in `.env`:
|
|
- `WHISPER_MODEL`: Model size (tiny/base/small/medium/large) - affects accuracy vs speed
|
|
- `WHISPER_DEVICE`: cpu or cuda
|
|
- `WHISPER_COMPUTE_TYPE`: int8/float16/float32
|
|
|
|
User display settings stored in SQLite (`data/settings.db`):
|
|
- Font family, size, weight, color
|
|
- Background color, opacity, border radius, padding
|
|
- Max words (controls caption buffer length)
|
|
|
|
## API Endpoints
|
|
|
|
| Endpoint | Method | Purpose |
|
|
|----------|--------|---------|
|
|
| `/` | GET | Main UI |
|
|
| `/api/health` | GET | Health check |
|
|
| `/api/settings` | GET/PUT | Read/update user settings |
|
|
| `/api/settings/reset` | POST | Reset to defaults |
|
|
|
|
## WebSocket Events
|
|
|
|
| Event | Direction | Payload |
|
|
|-------|-----------|---------|
|
|
| `audio_data` | client→server | `{audio: base64, format: 'webm'}` |
|
|
| `transcription` | server→client | `{text: string}` |
|
|
| `settings_updated` | server→client | settings object |
|
|
|
|
## Volumes
|
|
|
|
- `./data:/app/data` - SQLite database persistence
|
|
- `whisper-models` - Cached Whisper model files (~140MB for base)
|
|
|
|
## NVIDIA GPU Support
|
|
|
|
GPU acceleration significantly improves transcription speed. Follow these steps to enable it.
|
|
|
|
### Prerequisites
|
|
|
|
1. NVIDIA GPU with CUDA support
|
|
2. NVIDIA driver installed (`nvidia-smi` should work)
|
|
3. Docker installed
|
|
|
|
### Install NVIDIA Container Toolkit
|
|
|
|
```bash
|
|
# Add NVIDIA package repository
|
|
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
|
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
|
|
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
|
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
|
|
|
# Install the toolkit
|
|
sudo apt-get update
|
|
sudo apt-get install -y nvidia-container-toolkit
|
|
|
|
# Configure Docker to use NVIDIA runtime
|
|
sudo nvidia-ctk runtime configure --runtime=docker
|
|
sudo systemctl restart docker
|
|
|
|
# Verify installation
|
|
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
|
|
```
|
|
|
|
### Configure for GPU
|
|
|
|
1. Update `.env`:
|
|
```env
|
|
WHISPER_DEVICE=cuda
|
|
WHISPER_COMPUTE_TYPE=float16
|
|
```
|
|
|
|
2. Run with GPU support:
|
|
```bash
|
|
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build
|
|
```
|
|
|
|
### GPU Compute Types
|
|
|
|
| Type | Speed | Memory | Notes |
|
|
|------|-------|--------|-------|
|
|
| `float16` | Fast | Medium | Recommended for most GPUs |
|
|
| `int8_float16` | Faster | Lower | Good balance |
|
|
| `float32` | Slower | Higher | Maximum precision |
|
|
|
|
### Troubleshooting
|
|
|
|
- **"could not select device driver"**: NVIDIA Container Toolkit not installed or Docker not restarted
|
|
- **CUDA out of memory**: Try a smaller model (`WHISPER_MODEL=small` or `tiny`)
|
|
- **Verify GPU access**: `docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi`
|