# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Overview This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either: 1. **MCP Server** - For integration with Claude Desktop and other MCP clients 2. **REST API Server** - For HTTP-based integrations with async job queue support Both servers share the same core transcription logic and can run independently or simultaneously on different ports. **Key Features:** - Async job queue system for long-running transcriptions (prevents HTTP timeouts) - GPU health monitoring with strict failure detection (prevents silent CPU fallback) - **Automatic GPU driver reset** on CUDA errors with cooldown protection (handles sleep/wake issues) - Dual-server architecture (MCP + REST API) - Model caching for fast repeated transcriptions - Automatic batch size optimization based on GPU memory ## Development Commands ### Environment Setup ```bash # Create and activate virtual environment python3.12 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Install PyTorch with CUDA 12.6 support pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 # For CUDA 12.1 pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # For CPU-only pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu ``` ### Running the Servers #### MCP Server (for Claude Desktop) ```bash # Using the startup script (recommended - sets all env vars) ./run_mcp_server.sh # Direct Python execution python whisper_server.py # Using MCP CLI for development testing mcp dev whisper_server.py # Run server with MCP CLI mcp run whisper_server.py ``` #### REST API Server (for HTTP clients) ```bash # Using the startup script (recommended - sets all env vars) ./run_api_server.sh # Direct Python execution with uvicorn python api_server.py # Or using uvicorn directly uvicorn api_server:app --host 0.0.0.0 --port 8000 # Development mode with auto-reload uvicorn api_server:app --reload --host 0.0.0.0 --port 8000 ``` #### Running Both Simultaneously ```bash # Terminal 1: Start MCP server ./run_mcp_server.sh # Terminal 2: Start REST API server ./run_api_server.sh ``` ### Docker ```bash # Build Docker image docker build -t whisper-mcp-server . # Run with GPU support docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server ``` ## Architecture ### Directory Structure ``` . ├── src/ # Source code directory │ ├── servers/ # Server implementations │ │ ├── whisper_server.py # MCP server entry point │ │ └── api_server.py # REST API server (async job queue) │ ├── core/ # Core business logic │ │ ├── transcriber.py # Transcription logic (single & batch) │ │ ├── model_manager.py # Model lifecycle & caching │ │ ├── job_queue.py # Async job queue manager │ │ └── gpu_health.py # GPU health monitoring │ └── utils/ # Utility modules │ ├── audio_processor.py # Audio validation & preprocessing │ ├── formatters.py # Output format conversion │ └── test_audio_generator.py # Test audio generation for GPU checks ├── run_mcp_server.sh # MCP server startup script ├── run_api_server.sh # API server startup script ├── reset_gpu.sh # GPU driver reset script ├── DEV_PLAN.md # Development plan for async features ├── requirements.txt # Python dependencies └── pyproject.toml # Project configuration ``` ### Core Components 1. **src/servers/whisper_server.py** - MCP server entry point - Uses FastMCP framework to expose MCP tools - Three main tools: `get_model_info_api()`, `transcribe()`, `batch_transcribe_audio()` - Server initialization at line 19 2. **src/servers/api_server.py** - REST API server entry point - Uses FastAPI framework for HTTP endpoints - Provides REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe` - Shares core transcription logic with MCP server - File upload support via multipart/form-data 3. **src/core/transcriber.py** - Core transcription logic (shared by both servers) - `transcribe_audio()`:39 - Single file transcription with environment variable support - `batch_transcribe()`:209 - Batch processing with progress reporting - All parameters support environment variable defaults (lines 21-37) - Delegates output formatting to utils.formatters 4. **src/core/model_manager.py** - Whisper model lifecycle management - `get_whisper_model()`:44 - Returns cached model instances or loads new ones - `test_gpu_driver()`:20 - GPU validation before model loading - **CRITICAL**: GPU-only mode enforced at lines 64-90 (no CPU fallback) - Global `model_instances` dict caches loaded models to prevent reloading - Automatic batch size optimization based on GPU memory (lines 134-147) 5. **src/core/job_queue.py** - Async job queue manager - `JobQueue` class manages FIFO queue with background worker thread - `submit_job()` - Validates audio, checks GPU health, adds to queue - `get_job_status()` - Returns current job status and queue position - `get_job_result()` - Returns transcription result for completed jobs - Jobs persist to disk as JSON files for crash recovery - Single worker thread processes jobs sequentially (prevents GPU contention) 6. **src/core/gpu_health.py** - GPU health monitoring - `check_gpu_health()`:39 - Real GPU test using tiny model + test audio - `GPUHealthStatus` dataclass contains detailed GPU metrics - **CRITICAL**: Raises RuntimeError if device=cuda but GPU fails (lines 99-135) - Prevents silent CPU fallback that would cause 10-100x slowdown - `HealthMonitor` class for periodic background monitoring 7. **src/utils/audio_processor.py** - Audio file validation and preprocessing - `validate_audio_file()`:15 - Checks file existence, format, and size - `process_audio()`:50 - Decodes audio using faster_whisper's decode_audio 8. **src/utils/formatters.py** - Output format conversion - `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats - All formatters accept segment lists from Whisper output 9. **src/utils/test_audio_generator.py** - Test audio generation - `generate_test_audio()` - Creates synthetic 1-second audio for GPU health checks - Uses numpy to generate sine wave, no external audio files needed ### Key Architecture Patterns - **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior - **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (src/core/model_manager.py:104). This cache is shared if both servers run in the same process - **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160) - **Environment Variable Configuration**: All transcription parameters support env var defaults (src/core/transcriber.py:21-37) - **GPU-Only Mode**: Service is configured for GPU-only operation. `device="auto"` requires CUDA, `device="cpu"` is rejected (src/core/model_manager.py:64-90) - **Async Job Queue**: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling - **GPU Health Monitoring**: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU ## Environment Variables All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh: **API Server Specific:** - `API_HOST` - API server host (default: 0.0.0.0) - `API_PORT` - API server port (default: 8000) **Job Queue Configuration (if using async features):** - `JOB_QUEUE_MAX_SIZE` - Maximum queue size (default: 100) - `JOB_METADATA_DIR` - Directory for job metadata JSON files - `JOB_RETENTION_DAYS` - Auto-cleanup old jobs (0=disabled) **GPU Health Monitoring:** - `GPU_HEALTH_CHECK_ENABLED` - Enable periodic GPU monitoring (true/false) - `GPU_HEALTH_CHECK_INTERVAL_MINUTES` - Monitoring interval (default: 10) - `GPU_HEALTH_TEST_MODEL` - Model for health checks (default: tiny) **GPU Auto-Reset Configuration:** - `GPU_RESET_COOLDOWN_MINUTES` - Minimum time between GPU reset attempts (default: 5 minutes) - Prevents reset loops while allowing recovery from sleep/wake cycles - Auto-reset is **enabled by default** - Service terminates if GPU unavailable after reset attempt **Transcription Configuration (shared by both servers):** - `CUDA_VISIBLE_DEVICES` - GPU device selection - `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache) - `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions - `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing - `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3) - `TRANSCRIPTION_DEVICE` - Execution device (cuda, auto) - **NOTE: cpu is rejected in GPU-only mode** - `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto) - `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json) - `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5) - `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0) - `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false) - `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames - `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames - `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set) ## Supported Configurations - **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3 - **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac - **Output formats**: vtt, srt, json, txt - **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian) ## REST API Endpoints The REST API server provides the following HTTP endpoints: ### GET / Returns API information and available endpoints. ### GET /health Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`. ### GET /models Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available). ### POST /transcribe Transcribe a single audio file that exists on the server. **Request Body:** ```json { "audio_path": "/path/to/audio.mp3", "model_name": "large-v3", "device": "auto", "compute_type": "auto", "language": "en", "output_format": "txt", "beam_size": 5, "temperature": 0.0, "initial_prompt": null, "output_directory": null } ``` **Response:** ```json { "success": true, "message": "Transcription successful, results saved to: /path/to/output.txt", "output_path": "/path/to/output.txt" } ``` ### POST /batch-transcribe Batch transcribe all audio files in a folder. **Request Body:** ```json { "audio_folder": "/path/to/audio/folder", "output_folder": "/path/to/output", "model_name": "large-v3", "output_format": "txt", ... } ``` **Response:** ```json { "success": true, "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10" } ``` ### POST /upload-transcribe Upload an audio file and transcribe it immediately. Returns the transcription file as a download. **Form Data:** - `file`: Audio file (multipart/form-data) - `model_name`: Model name (default: "large-v3") - `device`: Device (default: "auto") - `output_format`: Output format (default: "txt") - ... (other transcription parameters) **Response:** Returns the transcription file for download. ### API Usage Examples ```bash # Get model information curl http://localhost:8000/models # Transcribe existing file (synchronous) curl -X POST http://localhost:8000/transcribe \ -H "Content-Type: application/json" \ -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}' # Upload and transcribe curl -X POST http://localhost:8000/upload-transcribe \ -F "file=@audio.mp3" \ -F "output_format=txt" \ -F "model_name=large-v3" # Async job queue (if enabled) # Submit job curl -X POST http://localhost:8000/jobs \ -H "Content-Type: application/json" \ -d '{"audio_path": "/path/to/audio.mp3"}' # Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1} # Check status curl http://localhost:8000/jobs/abc-123 # Returns: {"status": "running", ...} # Get result (when completed) curl http://localhost:8000/jobs/abc-123/result # Returns: transcription text # Check GPU health curl http://localhost:8000/health/gpu # Returns: {"gpu_available": true, "gpu_working": true, ...} ``` ## GPU Auto-Reset Configuration ### Overview This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is **enabled by default** and includes cooldown protection to prevent reset loops. ### Passwordless Sudo Setup (Required) For automatic GPU reset to work, you must configure passwordless sudo for NVIDIA commands. Create a sudoers configuration file: ```bash sudo visudo -f /etc/sudoers.d/whisper-gpu-reset ``` Add the following (replace `your_username` with your actual username): ``` # Whisper GPU Auto-Reset Permissions your_username ALL=(ALL) NOPASSWD: /bin/systemctl stop nvidia-persistenced your_username ALL=(ALL) NOPASSWD: /bin/systemctl start nvidia-persistenced your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_uvm your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_drm your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_modeset your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_modeset your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_uvm your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_drm ``` **Security Note:** These permissions are limited to specific NVIDIA driver commands only. The reset script (`reset_gpu.sh`) is executed with sudo but is part of the codebase and can be audited. ### How It Works 1. **Startup Check**: When the service starts, it performs a GPU health check - If CUDA errors detected → automatic reset attempt → retry - If retry fails → service terminates 2. **Runtime Check**: Before job submission and model loading - If CUDA errors detected → automatic reset attempt → retry - If retry fails → job rejected, service continues 3. **Cooldown Protection**: Prevents reset loops - Minimum 5 minutes between reset attempts (configurable via `GPU_RESET_COOLDOWN_MINUTES`) - Cooldown persists across restarts (stored in `/tmp/whisper-gpu-last-reset`) - If reset needed but cooldown active → service/job fails immediately ### Manual GPU Reset You can manually reset the GPU anytime: ```bash ./reset_gpu.sh ``` Or clear the cooldown to allow immediate reset: ```python from core.gpu_reset import clear_reset_cooldown clear_reset_cooldown() ``` ### Behavior Examples **After sleep/wake with GPU issue:** ``` Service starts → GPU check fails (CUDA error) → Cooldown OK → Reset drivers → Wait 3s → Retry → Success → Service continues ``` **Multiple failures (hardware issue):** ``` First failure → Reset → Retry fails → Job fails Second failure within 5 min → Cooldown active → Fail immediately (Prevents reset loop) ``` **Normal operation:** ``` No CUDA errors → No resets → Normal performance Reset only happens on actual CUDA failures ``` ## Important Implementation Details ### GPU-Only Architecture - **CRITICAL**: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90) - `device="auto"` requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73) - GPU health checks use real model loading + transcription, not just torch.cuda.is_available() - If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU - **GPU Auto-Reset**: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues) ### Model Management - GPU memory is checked before loading models (src/core/model_manager.py:115-127) - Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise) - Models are cached globally in `model_instances` dict, shared across requests - Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114) ### Transcription Settings - VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102) - Word timestamps are enabled by default (src/core/transcriber.py:107) - Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42) - Default output format is "txt" for REST API, configured via environment variables for MCP server ### Async Job Queue (if enabled) - Single worker thread processes jobs sequentially (prevents GPU memory contention) - Jobs persist to disk as JSON files in JOB_METADATA_DIR - Queue has max size limit (default 100), returns 503 when full - Job status polling recommended every 5-10 seconds for LLM agents ## Development Workflow ### Testing GPU Health ```python # Test GPU health check manually from src.core.gpu_health import check_gpu_health status = check_gpu_health(expected_device="cuda") print(f"GPU Working: {status.gpu_working}") print(f"Device: {status.device_used}") print(f"Test Duration: {status.test_duration_seconds}s") # Expected: <1s for GPU, 3-10s for CPU ``` ### Testing Job Queue ```python # Test job queue manually from src.core.job_queue import JobQueue queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs") queue.start() # Submit job job_info = queue.submit_job( audio_path="/path/to/test.mp3", model_name="large-v3", device="cuda" ) print(f"Job ID: {job_info['job_id']}") # Poll status status = queue.get_job_status(job_info['job_id']) print(f"Status: {status['status']}") # Get result when completed result = queue.get_job_result(job_info['job_id']) ``` ### Common Debugging **Model loading issues:** - Check `WHISPER_MODEL_DIR` is set correctly - Verify GPU memory with `nvidia-smi` - Check logs for GPU driver test failures at model_manager.py:112-114 **GPU not detected:** - Verify `CUDA_VISIBLE_DEVICES` is set correctly - Check `torch.cuda.is_available()` returns True - Run GPU health check to see detailed error **Silent failures:** - Check that service is NOT silently falling back to CPU - GPU health check should RAISE errors, not log warnings - If device=cuda fails, the job should be rejected, not processed on CPU **Job queue issues:** - Check `JOB_METADATA_DIR` exists and is writable - Verify background worker thread is running (check logs) - Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json ### File Locations - **Logs**: `mcp.logs` (MCP server), `api.logs` (API server) - **Models**: `$WHISPER_MODEL_DIR` or HuggingFace cache - **Outputs**: `$TRANSCRIPTION_OUTPUT_DIR` or `$TRANSCRIPTION_BATCH_OUTPUT_DIR` - **Job Metadata**: `$JOB_METADATA_DIR/{job_id}.json` ### Important Development Notes - See `DEV_PLAN.md` for detailed architecture and implementation plan for async job queue features - The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation - When modifying model_manager.py, maintain the strict GPU-only enforcement - When adding new endpoints, follow the async pattern if transcription time >30 seconds