Files
Fast-Whisper-MCP-Server/CLAUDE.md
2025-10-10 01:49:48 +03:00

8.0 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

fast-whisper-mcp-server is a high-performance audio transcription service built on faster-whisper with dual-server architecture:

  • MCP Server (whisper_server.py): Model Context Protocol interface for LLM integration
  • REST API Server (api_server.py): HTTP REST endpoints with FastAPI

The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. GPU is required - there is no CPU fallback.

Core Commands

Running Servers

# MCP Server (for LLM integration via MCP)
./run_mcp_server.sh

# REST API Server (for HTTP clients)
./run_api_server.sh

# Both servers log to mcp.logs and api.logs respectively

Testing

# Run core component tests (GPU health, job queue, validation)
python tests/test_core_components.py

# Run async API integration tests
python tests/test_async_api_integration.py

# Run end-to-end integration tests
python tests/test_e2e_integration.py

GPU Management

# Reset GPU drivers without rebooting (requires sudo)
./reset_gpu.sh

# Check GPU status
nvidia-smi

# Monitor GPU during transcription
watch -n 1 nvidia-smi

Installation

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies (check requirements.txt for CUDA-specific instructions)
pip install -r requirements.txt

# For CUDA 12.4:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Architecture

Directory Structure

src/
├── core/              # Core business logic
│   ├── transcriber.py     # Main transcription logic with env var defaults
│   ├── model_manager.py   # Whisper model loading/caching (GPU-only)
│   ├── job_queue.py       # Async FIFO job queue with worker thread
│   ├── gpu_health.py      # Real GPU health checks with circuit breaker
│   └── gpu_reset.py       # Automatic GPU driver reset logic
├── servers/           # Server implementations
│   ├── whisper_server.py  # MCP server (stdio transport)
│   └── api_server.py      # FastAPI REST server
└── utils/             # Utilities
    ├── startup.py         # Common startup sequence (GPU check, initialization)
    ├── circuit_breaker.py # Circuit breaker pattern implementation
    ├── audio_processor.py # Audio file validation
    ├── formatters.py      # Output format handlers (txt, vtt, srt, json)
    ├── input_validation.py# Input validation utilities
    └── test_audio_generator.py  # Generate test audio for health checks

Key Architectural Patterns

Async Job Queue (job_queue.py):

  • FIFO queue with background worker thread
  • Disk persistence of job metadata to JOB_METADATA_DIR
  • States: QUEUED → RUNNING → COMPLETED/FAILED
  • Jobs include full request params + results
  • Thread-safe operations with locks

GPU Health Monitoring (gpu_health.py):

  • Performs real GPU checks: loads tiny model + transcribes test audio
  • Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
  • Integration with GPU auto-reset on failures
  • Background monitoring thread in HealthMonitor class
  • Never falls back to CPU - raises RuntimeError if GPU unavailable

GPU Auto-Reset (gpu_reset.py):

  • Automatically resets GPU drivers via reset_gpu.sh when health checks fail
  • Cooldown mechanism (default 5 min via GPU_RESET_COOLDOWN_MINUTES)
  • Sudo required - script unloads/reloads nvidia kernel modules
  • Integrated with circuit breaker to avoid reset loops

Startup Sequence (startup.py):

  • Common startup logic for both servers
  • Phase 1: GPU health check with optional auto-reset
  • Phase 2: Initialize job queue
  • Phase 3: Initialize health monitor (background thread)
  • Exits on GPU failure unless configured otherwise

Circuit Breaker (circuit_breaker.py):

  • States: CLOSED → OPEN → HALF_OPEN → CLOSED
  • Configurable failure/success thresholds
  • Prevents cascading failures and resource exhaustion
  • Used for GPU health checks and model operations

Environment Variables

Both server scripts set extensive environment variables. Key ones:

GPU/CUDA:

  • CUDA_VISIBLE_DEVICES: GPU index (default: 1)
  • LD_LIBRARY_PATH: CUDA library path
  • TRANSCRIPTION_DEVICE: "cuda" or "auto" (never "cpu")
  • TRANSCRIPTION_COMPUTE_TYPE: "float16", "int8", or "auto"

Paths:

  • WHISPER_MODEL_DIR: Where Whisper models are cached
  • TRANSCRIPTION_OUTPUT_DIR: Transcription output directory
  • TRANSCRIPTION_BATCH_OUTPUT_DIR: Batch output directory
  • JOB_METADATA_DIR: Job metadata persistence directory

Transcription Defaults:

  • TRANSCRIPTION_MODEL: Model name (default: "large-v3")
  • TRANSCRIPTION_OUTPUT_FORMAT: "txt", "vtt", "srt", or "json"
  • TRANSCRIPTION_BEAM_SIZE: Beam search size (default: 5 for API, 2 for MCP)
  • TRANSCRIPTION_TEMPERATURE: Sampling temperature (default: 0.0)

Job Queue:

  • JOB_QUEUE_MAX_SIZE: Max queued jobs (default: 100 for MCP, 5 for API)
  • JOB_RETENTION_DAYS: How long to keep job metadata (default: 7)

Health Monitoring:

  • GPU_HEALTH_CHECK_ENABLED: Enable background monitoring (default: true)
  • GPU_HEALTH_CHECK_INTERVAL_MINUTES: Check interval (default: 10)
  • GPU_HEALTH_TEST_MODEL: Model for health checks (default: "tiny")
  • GPU_RESET_COOLDOWN_MINUTES: Cooldown between reset attempts (default: 5)

API Workflow (Async Jobs)

Both MCP and REST API use the same async workflow:

  1. Submit job: transcribe_async() returns job_id immediately
  2. Poll status: get_job_status(job_id) returns status + queue_position
  3. Get result: When status="completed", get_job_result(job_id) returns transcription

The job queue processes one job at a time in a background worker thread.

Model Loading Strategy

  • Models are cached in model_instances dict (key: model_name + device + compute_type)
  • First load downloads model to WHISPER_MODEL_DIR (or default cache)
  • GPU health check on model load - may trigger auto-reset if GPU fails
  • No CPU fallback - raises RuntimeError if CUDA unavailable

Important Implementation Details

GPU-Only Architecture:

  • All device="auto" resolution checks torch.cuda.is_available() and raises error if False
  • No silent fallback to CPU anywhere in the codebase
  • Health checks verify model actually ran on GPU (check torch.cuda.memory_allocated)

Thread Safety:

  • JobQueue uses threading.Lock for job dictionary access
  • Worker thread processes jobs from queue.Queue (thread-safe FIFO)
  • HealthMonitor runs in separate daemon thread

Error Handling:

  • Circuit breaker prevents retry storms on GPU failures
  • Input validation rejects invalid audio files, model names, languages
  • Job errors are captured and stored in job metadata with status=FAILED

Shutdown Handling:

  • cleanup_on_shutdown() waits for current job to complete
  • Stops health monitor thread
  • Saves final job states to disk

Common Development Tasks

Adding a new output format:

  1. Add formatter function in src/utils/formatters.py
  2. Add case in transcribe_audio() in src/core/transcriber.py
  3. Update API docs and MCP tool descriptions

Adjusting GPU health check behavior:

  1. Modify circuit breaker params in src/core/gpu_health.py
  2. Adjust health check interval in environment variables
  3. Consider cooldown timing in src/core/gpu_reset.py

Testing GPU reset logic:

  1. Manually trigger GPU failure (e.g., occupy all GPU memory)
  2. Watch logs for circuit breaker state transitions
  3. Verify reset attempt with cooldown enforcement
  4. Check nvidia-smi before/after reset

Debugging job queue issues:

  1. Check job metadata files in JOB_METADATA_DIR
  2. Look for lock contention in logs
  3. Verify worker thread is running (check logs for "Job queue worker started")
  4. Test with JOB_QUEUE_MAX_SIZE=1 to isolate serialization