bunker-admin c7becf330c Initial commit: Live Captions web application

Real-time speech-to-text using OpenAI Whisper (faster-whisper).
Features browser audio capture, WebSocket streaming, and customizable display settings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-12 08:53:40 -07:00

5.4 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Live Captions is a Dockerized web application that provides real-time speech-to-text captions using OpenAI's Whisper model (via faster-whisper). It captures microphone audio in the browser, streams it to a Flask backend for transcription, and displays captions with customizable styling.

Commands

Development

# Build and run (primary development command)
docker compose up --build

# Run in background
docker compose up -d --build

# View logs
docker compose logs -f

# Stop
docker compose down

# Reset all data (database + cached models)
docker compose down -v

First-time setup

cp .env.example .env
docker compose up --build

Architecture

Browser                          Docker Container
┌─────────────────────┐         ┌─────────────────────────────┐
│  MediaRecorder API  │         │  Flask + Flask-SocketIO     │
│  (1.5s audio chunks)│ ──────► │         (app.py)            │
│                     │ WebSocket│            │                │
│  Caption Display    │ ◄────── │  faster-whisper transcriber │
│  (word-by-word)     │         │      (transcriber.py)       │
│                     │         │            │                │
│  Settings Panel     │ ──────► │  SQLite settings persistence│
│                     │ REST API│      (database.py)          │
└─────────────────────┘         └─────────────────────────────┘

Data Flow

Browser captures mic audio using MediaRecorder, sends base64-encoded WebM chunks every 1.5s via WebSocket
Backend converts WebM→WAV using pydub/ffmpeg, transcribes with faster-whisper
Transcribed text sent back via WebSocket transcription event
Frontend animates words appearing one-by-one for streaming effect

Key Files

app.py: Flask server with SocketIO WebSocket handlers and REST API for settings
transcriber.py: Whisper model loading and audio transcription (singleton model instance)
database.py: SQLite CRUD for user display preferences
static/js/app.js: Audio capture, WebSocket client, word animation queue
static/js/settings.js: Settings panel UI and persistence

Configuration

Environment variables in .env:

WHISPER_MODEL: Model size (tiny/base/small/medium/large) - affects accuracy vs speed
WHISPER_DEVICE: cpu or cuda
WHISPER_COMPUTE_TYPE: int8/float16/float32

User display settings stored in SQLite (data/settings.db):

Font family, size, weight, color
Background color, opacity, border radius, padding
Max words (controls caption buffer length)

API Endpoints

Endpoint	Method	Purpose
`/`	GET	Main UI
`/api/health`	GET	Health check
`/api/settings`	GET/PUT	Read/update user settings
`/api/settings/reset`	POST	Reset to defaults

WebSocket Events

Event	Direction	Payload
`audio_data`	client→server	`{audio: base64, format: 'webm'}`
`transcription`	server→client	`{text: string}`
`settings_updated`	server→client	settings object

Volumes

./data:/app/data - SQLite database persistence
whisper-models - Cached Whisper model files (~140MB for base)

NVIDIA GPU Support

GPU acceleration significantly improves transcription speed. Follow these steps to enable it.

Prerequisites

NVIDIA GPU with CUDA support
NVIDIA driver installed (nvidia-smi should work)
Docker installed

Install NVIDIA Container Toolkit

# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify installation
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Configure for GPU

Update .env:

WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16

Run with GPU support:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

GPU Compute Types

Type	Speed	Memory	Notes
`float16`	Fast	Medium	Recommended for most GPUs
`int8_float16`	Faster	Lower	Good balance
`float32`	Slower	Higher	Maximum precision

Troubleshooting

"could not select device driver": NVIDIA Container Toolkit not installed or Docker not restarted
CUDA out of memory: Try a smaller model (WHISPER_MODEL=small or tiny)
Verify GPU access: docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

5.4 KiB Raw Permalink Blame History