alihan/Fast-Whisper-MCP-Server

Fork 0

Files

Alihan 06b8bc1304 update claude md

2025-10-10 01:49:48 +03:00

8.0 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

fast-whisper-mcp-server is a high-performance audio transcription service built on faster-whisper with dual-server architecture:

MCP Server (whisper_server.py): Model Context Protocol interface for LLM integration
REST API Server (api_server.py): HTTP REST endpoints with FastAPI

The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. GPU is required - there is no CPU fallback.

Core Commands

Running Servers

# MCP Server (for LLM integration via MCP)
./run_mcp_server.sh

# REST API Server (for HTTP clients)
./run_api_server.sh

# Both servers log to mcp.logs and api.logs respectively

Testing

# Run core component tests (GPU health, job queue, validation)
python tests/test_core_components.py

# Run async API integration tests
python tests/test_async_api_integration.py

# Run end-to-end integration tests
python tests/test_e2e_integration.py

GPU Management

# Reset GPU drivers without rebooting (requires sudo)
./reset_gpu.sh

# Check GPU status
nvidia-smi

# Monitor GPU during transcription
watch -n 1 nvidia-smi

Installation

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies (check requirements.txt for CUDA-specific instructions)
pip install -r requirements.txt

# For CUDA 12.4:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Architecture

Directory Structure

src/
├── core/              # Core business logic
│   ├── transcriber.py     # Main transcription logic with env var defaults
│   ├── model_manager.py   # Whisper model loading/caching (GPU-only)
│   ├── job_queue.py       # Async FIFO job queue with worker thread
│   ├── gpu_health.py      # Real GPU health checks with circuit breaker
│   └── gpu_reset.py       # Automatic GPU driver reset logic
├── servers/           # Server implementations
│   ├── whisper_server.py  # MCP server (stdio transport)
│   └── api_server.py      # FastAPI REST server
└── utils/             # Utilities
    ├── startup.py         # Common startup sequence (GPU check, initialization)
    ├── circuit_breaker.py # Circuit breaker pattern implementation
    ├── audio_processor.py # Audio file validation
    ├── formatters.py      # Output format handlers (txt, vtt, srt, json)
    ├── input_validation.py# Input validation utilities
    └── test_audio_generator.py  # Generate test audio for health checks

Key Architectural Patterns

Async Job Queue (job_queue.py):

FIFO queue with background worker thread
Disk persistence of job metadata to JOB_METADATA_DIR
States: QUEUED → RUNNING → COMPLETED/FAILED
Jobs include full request params + results
Thread-safe operations with locks

GPU Health Monitoring (gpu_health.py):

Performs real GPU checks: loads tiny model + transcribes test audio
Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
Integration with GPU auto-reset on failures
Background monitoring thread in HealthMonitor class
Never falls back to CPU - raises RuntimeError if GPU unavailable

GPU Auto-Reset (gpu_reset.py):

Automatically resets GPU drivers via reset_gpu.sh when health checks fail
Cooldown mechanism (default 5 min via GPU_RESET_COOLDOWN_MINUTES)
Sudo required - script unloads/reloads nvidia kernel modules
Integrated with circuit breaker to avoid reset loops

Startup Sequence (startup.py):

Common startup logic for both servers
Phase 1: GPU health check with optional auto-reset
Phase 2: Initialize job queue
Phase 3: Initialize health monitor (background thread)
Exits on GPU failure unless configured otherwise

Circuit Breaker (circuit_breaker.py):

States: CLOSED → OPEN → HALF_OPEN → CLOSED
Configurable failure/success thresholds
Prevents cascading failures and resource exhaustion
Used for GPU health checks and model operations

Environment Variables

Both server scripts set extensive environment variables. Key ones:

GPU/CUDA:

CUDA_VISIBLE_DEVICES: GPU index (default: 1)
LD_LIBRARY_PATH: CUDA library path
TRANSCRIPTION_DEVICE: "cuda" or "auto" (never "cpu")
TRANSCRIPTION_COMPUTE_TYPE: "float16", "int8", or "auto"

Paths:

WHISPER_MODEL_DIR: Where Whisper models are cached
TRANSCRIPTION_OUTPUT_DIR: Transcription output directory
TRANSCRIPTION_BATCH_OUTPUT_DIR: Batch output directory
JOB_METADATA_DIR: Job metadata persistence directory

Transcription Defaults:

TRANSCRIPTION_MODEL: Model name (default: "large-v3")
TRANSCRIPTION_OUTPUT_FORMAT: "txt", "vtt", "srt", or "json"
TRANSCRIPTION_BEAM_SIZE: Beam search size (default: 5 for API, 2 for MCP)
TRANSCRIPTION_TEMPERATURE: Sampling temperature (default: 0.0)

Job Queue:

JOB_QUEUE_MAX_SIZE: Max queued jobs (default: 100 for MCP, 5 for API)
JOB_RETENTION_DAYS: How long to keep job metadata (default: 7)

Health Monitoring:

GPU_HEALTH_CHECK_ENABLED: Enable background monitoring (default: true)
GPU_HEALTH_CHECK_INTERVAL_MINUTES: Check interval (default: 10)
GPU_HEALTH_TEST_MODEL: Model for health checks (default: "tiny")
GPU_RESET_COOLDOWN_MINUTES: Cooldown between reset attempts (default: 5)

API Workflow (Async Jobs)

Both MCP and REST API use the same async workflow:

Submit job: transcribe_async() returns job_id immediately
Poll status: get_job_status(job_id) returns status + queue_position
Get result: When status="completed", get_job_result(job_id) returns transcription

The job queue processes one job at a time in a background worker thread.

Model Loading Strategy

Models are cached in model_instances dict (key: model_name + device + compute_type)
First load downloads model to WHISPER_MODEL_DIR (or default cache)
GPU health check on model load - may trigger auto-reset if GPU fails
No CPU fallback - raises RuntimeError if CUDA unavailable

Important Implementation Details

GPU-Only Architecture:

All device="auto" resolution checks torch.cuda.is_available() and raises error if False
No silent fallback to CPU anywhere in the codebase
Health checks verify model actually ran on GPU (check torch.cuda.memory_allocated)

Thread Safety:

JobQueue uses threading.Lock for job dictionary access
Worker thread processes jobs from queue.Queue (thread-safe FIFO)
HealthMonitor runs in separate daemon thread

Error Handling:

Circuit breaker prevents retry storms on GPU failures
Input validation rejects invalid audio files, model names, languages
Job errors are captured and stored in job metadata with status=FAILED

Shutdown Handling:

cleanup_on_shutdown() waits for current job to complete
Stops health monitor thread
Saves final job states to disk

Common Development Tasks

Adding a new output format:

Add formatter function in src/utils/formatters.py
Add case in transcribe_audio() in src/core/transcriber.py
Update API docs and MCP tool descriptions

Adjusting GPU health check behavior:

Modify circuit breaker params in src/core/gpu_health.py
Adjust health check interval in environment variables
Consider cooldown timing in src/core/gpu_reset.py

Testing GPU reset logic:

Manually trigger GPU failure (e.g., occupy all GPU memory)
Watch logs for circuit breaker state transitions
Verify reset attempt with cooldown enforcement
Check nvidia-smi before/after reset

Debugging job queue issues:

Check job metadata files in JOB_METADATA_DIR
Look for lock contention in logs
Verify worker thread is running (check logs for "Job queue worker started")
Test with JOB_QUEUE_MAX_SIZE=1 to isolate serialization

8.0 KiB Raw Blame History