8.0 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
fast-whisper-mcp-server is a high-performance audio transcription service built on faster-whisper with dual-server architecture:
- MCP Server (
whisper_server.py): Model Context Protocol interface for LLM integration - REST API Server (
api_server.py): HTTP REST endpoints with FastAPI
The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. GPU is required - there is no CPU fallback.
Core Commands
Running Servers
# MCP Server (for LLM integration via MCP)
./run_mcp_server.sh
# REST API Server (for HTTP clients)
./run_api_server.sh
# Both servers log to mcp.logs and api.logs respectively
Testing
# Run core component tests (GPU health, job queue, validation)
python tests/test_core_components.py
# Run async API integration tests
python tests/test_async_api_integration.py
# Run end-to-end integration tests
python tests/test_e2e_integration.py
GPU Management
# Reset GPU drivers without rebooting (requires sudo)
./reset_gpu.sh
# Check GPU status
nvidia-smi
# Monitor GPU during transcription
watch -n 1 nvidia-smi
Installation
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate
# Install dependencies (check requirements.txt for CUDA-specific instructions)
pip install -r requirements.txt
# For CUDA 12.4:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
Architecture
Directory Structure
src/
├── core/ # Core business logic
│ ├── transcriber.py # Main transcription logic with env var defaults
│ ├── model_manager.py # Whisper model loading/caching (GPU-only)
│ ├── job_queue.py # Async FIFO job queue with worker thread
│ ├── gpu_health.py # Real GPU health checks with circuit breaker
│ └── gpu_reset.py # Automatic GPU driver reset logic
├── servers/ # Server implementations
│ ├── whisper_server.py # MCP server (stdio transport)
│ └── api_server.py # FastAPI REST server
└── utils/ # Utilities
├── startup.py # Common startup sequence (GPU check, initialization)
├── circuit_breaker.py # Circuit breaker pattern implementation
├── audio_processor.py # Audio file validation
├── formatters.py # Output format handlers (txt, vtt, srt, json)
├── input_validation.py# Input validation utilities
└── test_audio_generator.py # Generate test audio for health checks
Key Architectural Patterns
Async Job Queue (job_queue.py):
- FIFO queue with background worker thread
- Disk persistence of job metadata to
JOB_METADATA_DIR - States: QUEUED → RUNNING → COMPLETED/FAILED
- Jobs include full request params + results
- Thread-safe operations with locks
GPU Health Monitoring (gpu_health.py):
- Performs real GPU checks: loads tiny model + transcribes test audio
- Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
- Integration with GPU auto-reset on failures
- Background monitoring thread in
HealthMonitorclass - Never falls back to CPU - raises RuntimeError if GPU unavailable
GPU Auto-Reset (gpu_reset.py):
- Automatically resets GPU drivers via
reset_gpu.shwhen health checks fail - Cooldown mechanism (default 5 min via
GPU_RESET_COOLDOWN_MINUTES) - Sudo required - script unloads/reloads nvidia kernel modules
- Integrated with circuit breaker to avoid reset loops
Startup Sequence (startup.py):
- Common startup logic for both servers
- Phase 1: GPU health check with optional auto-reset
- Phase 2: Initialize job queue
- Phase 3: Initialize health monitor (background thread)
- Exits on GPU failure unless configured otherwise
Circuit Breaker (circuit_breaker.py):
- States: CLOSED → OPEN → HALF_OPEN → CLOSED
- Configurable failure/success thresholds
- Prevents cascading failures and resource exhaustion
- Used for GPU health checks and model operations
Environment Variables
Both server scripts set extensive environment variables. Key ones:
GPU/CUDA:
CUDA_VISIBLE_DEVICES: GPU index (default: 1)LD_LIBRARY_PATH: CUDA library pathTRANSCRIPTION_DEVICE: "cuda" or "auto" (never "cpu")TRANSCRIPTION_COMPUTE_TYPE: "float16", "int8", or "auto"
Paths:
WHISPER_MODEL_DIR: Where Whisper models are cachedTRANSCRIPTION_OUTPUT_DIR: Transcription output directoryTRANSCRIPTION_BATCH_OUTPUT_DIR: Batch output directoryJOB_METADATA_DIR: Job metadata persistence directory
Transcription Defaults:
TRANSCRIPTION_MODEL: Model name (default: "large-v3")TRANSCRIPTION_OUTPUT_FORMAT: "txt", "vtt", "srt", or "json"TRANSCRIPTION_BEAM_SIZE: Beam search size (default: 5 for API, 2 for MCP)TRANSCRIPTION_TEMPERATURE: Sampling temperature (default: 0.0)
Job Queue:
JOB_QUEUE_MAX_SIZE: Max queued jobs (default: 100 for MCP, 5 for API)JOB_RETENTION_DAYS: How long to keep job metadata (default: 7)
Health Monitoring:
GPU_HEALTH_CHECK_ENABLED: Enable background monitoring (default: true)GPU_HEALTH_CHECK_INTERVAL_MINUTES: Check interval (default: 10)GPU_HEALTH_TEST_MODEL: Model for health checks (default: "tiny")GPU_RESET_COOLDOWN_MINUTES: Cooldown between reset attempts (default: 5)
API Workflow (Async Jobs)
Both MCP and REST API use the same async workflow:
- Submit job:
transcribe_async()returnsjob_idimmediately - Poll status:
get_job_status(job_id)returns status + queue_position - Get result: When status="completed",
get_job_result(job_id)returns transcription
The job queue processes one job at a time in a background worker thread.
Model Loading Strategy
- Models are cached in
model_instancesdict (key: model_name + device + compute_type) - First load downloads model to
WHISPER_MODEL_DIR(or default cache) - GPU health check on model load - may trigger auto-reset if GPU fails
- No CPU fallback - raises
RuntimeErrorif CUDA unavailable
Important Implementation Details
GPU-Only Architecture:
- All
device="auto"resolution checkstorch.cuda.is_available()and raises error if False - No silent fallback to CPU anywhere in the codebase
- Health checks verify model actually ran on GPU (check
torch.cuda.memory_allocated)
Thread Safety:
JobQueueusesthreading.Lockfor job dictionary access- Worker thread processes jobs from
queue.Queue(thread-safe FIFO) HealthMonitorruns in separate daemon thread
Error Handling:
- Circuit breaker prevents retry storms on GPU failures
- Input validation rejects invalid audio files, model names, languages
- Job errors are captured and stored in job metadata with status=FAILED
Shutdown Handling:
cleanup_on_shutdown()waits for current job to complete- Stops health monitor thread
- Saves final job states to disk
Common Development Tasks
Adding a new output format:
- Add formatter function in
src/utils/formatters.py - Add case in
transcribe_audio()insrc/core/transcriber.py - Update API docs and MCP tool descriptions
Adjusting GPU health check behavior:
- Modify circuit breaker params in
src/core/gpu_health.py - Adjust health check interval in environment variables
- Consider cooldown timing in
src/core/gpu_reset.py
Testing GPU reset logic:
- Manually trigger GPU failure (e.g., occupy all GPU memory)
- Watch logs for circuit breaker state transitions
- Verify reset attempt with cooldown enforcement
- Check
nvidia-smibefore/after reset
Debugging job queue issues:
- Check job metadata files in
JOB_METADATA_DIR - Look for lock contention in logs
- Verify worker thread is running (check logs for "Job queue worker started")
- Test with
JOB_QUEUE_MAX_SIZE=1to isolate serialization