Files
Fast-Whisper-MCP-Server/CLAUDE.md
Alihan 66b36e71e8 Update documentation and configuration
- Update CLAUDE.md with new test suite documentation
- Add PYTHONPATH instructions for direct execution
- Document new utility modules (startup, circuit_breaker, input_validation)
- Remove passwordless sudo section from GPU auto-reset docs
- Reduce job queue max size to 5 in API server config
- Rename supervisor program to transcriptor-api
- Remove log files from repository
2025-10-10 01:22:41 +03:00

22 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:

  1. MCP Server - For integration with Claude Desktop and other MCP clients
  2. REST API Server - For HTTP-based integrations with async job queue support

Both servers share the same core transcription logic and can run independently or simultaneously on different ports.

Key Features:

  • Async job queue system for long-running transcriptions (prevents HTTP timeouts)
  • GPU health monitoring with strict failure detection (prevents silent CPU fallback)
  • Automatic GPU driver reset on CUDA errors with cooldown protection (handles sleep/wake issues)
  • Dual-server architecture (MCP + REST API)
  • Model caching for fast repeated transcriptions
  • Automatic batch size optimization based on GPU memory

Development Commands

Environment Setup

# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Running the Servers

MCP Server (for Claude Desktop)

# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh

# Direct Python execution (ensure PYTHONPATH includes src/)
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
python src/servers/whisper_server.py

# Using MCP CLI for development testing
mcp dev src/servers/whisper_server.py

REST API Server (for HTTP clients)

# Using the startup script (recommended - sets all env vars)
./run_api_server.sh

# Direct Python execution with uvicorn (ensure PYTHONPATH includes src/)
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
uvicorn src.servers.api_server:app --host 0.0.0.0 --port 8000

# Development mode with auto-reload
uvicorn src.servers.api_server:app --reload --host 0.0.0.0 --port 8000

Running Both Simultaneously

# Terminal 1: Start MCP server
./run_mcp_server.sh

# Terminal 2: Start REST API server
./run_api_server.sh

Running Tests

# Run all tests (requires GPU)
python tests/test_core_components.py
python tests/test_e2e_integration.py
python tests/test_async_api_integration.py

# Or run individual test components
cd tests && python test_core_components.py

Important: All tests require GPU to be available. Tests will fail if CUDA is not properly configured.

Docker

# Build Docker image
docker build -t whisper-mcp-server .

# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server

Architecture

Directory Structure

.
├── src/                          # Source code directory
│   ├── servers/                  # Server implementations
│   │   ├── whisper_server.py    # MCP server entry point
│   │   └── api_server.py        # REST API server (async job queue)
│   ├── core/                     # Core business logic
│   │   ├── transcriber.py       # Transcription logic (single & batch)
│   │   ├── model_manager.py     # Model lifecycle & caching
│   │   ├── job_queue.py         # Async job queue manager
│   │   ├── gpu_health.py        # GPU health monitoring
│   │   └── gpu_reset.py         # GPU driver reset with cooldown
│   └── utils/                    # Utility modules
│       ├── audio_processor.py   # Audio validation & preprocessing
│       ├── formatters.py        # Output format conversion
│       ├── test_audio_generator.py # Test audio generation for GPU checks
│       ├── startup.py           # Startup sequence orchestration
│       ├── circuit_breaker.py   # Circuit breaker pattern implementation
│       └── input_validation.py  # Input validation utilities
├── tests/                        # Test suite (requires GPU)
│   ├── test_core_components.py  # Core functionality tests
│   ├── test_e2e_integration.py  # End-to-end integration tests
│   └── test_async_api_integration.py # Async API tests
├── run_mcp_server.sh            # MCP server startup script
├── run_api_server.sh            # API server startup script
├── reset_gpu.sh                 # GPU driver reset script
├── DEV_PLAN.md                  # Development plan for async features
├── requirements.txt              # Python dependencies
└── pyproject.toml               # Project configuration

Core Components

  1. src/servers/whisper_server.py - MCP server entry point

    • Uses FastMCP framework to expose MCP tools
    • Main tools: get_model_info_api(), transcribe_async(), transcribe_upload(), check_job_status(), get_job_result()
    • Global job queue and health monitor instances
    • Server initialization around line 31
  2. src/servers/api_server.py - REST API server entry point

    • Uses FastAPI framework for HTTP endpoints
    • Provides REST endpoints: /, /health, /models, /transcribe, /batch-transcribe, /upload-transcribe
    • Shares core transcription logic with MCP server
    • File upload support via multipart/form-data
  3. src/core/transcriber.py - Core transcription logic (shared by both servers)

    • transcribe_audio():39 - Single file transcription with environment variable support
    • batch_transcribe():209 - Batch processing with progress reporting
    • All parameters support environment variable defaults (lines 21-37)
    • Delegates output formatting to utils.formatters
  4. src/core/model_manager.py - Whisper model lifecycle management

    • get_whisper_model():44 - Returns cached model instances or loads new ones
    • test_gpu_driver():20 - GPU validation before model loading
    • CRITICAL: GPU-only mode enforced at lines 64-90 (no CPU fallback)
    • Global model_instances dict caches loaded models to prevent reloading
    • Automatic batch size optimization based on GPU memory (lines 134-147)
  5. src/core/job_queue.py - Async job queue manager

    • JobQueue class manages FIFO queue with background worker thread
    • submit_job() - Validates audio, checks GPU health, adds to queue
    • get_job_status() - Returns current job status and queue position
    • get_job_result() - Returns transcription result for completed jobs
    • Jobs persist to disk as JSON files for crash recovery
    • Single worker thread processes jobs sequentially (prevents GPU contention)
  6. src/core/gpu_health.py - GPU health monitoring

    • check_gpu_health():39 - Real GPU test using tiny model + test audio
    • GPUHealthStatus dataclass contains detailed GPU metrics
    • CRITICAL: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
    • Prevents silent CPU fallback that would cause 10-100x slowdown
    • HealthMonitor class for periodic background monitoring
  7. src/utils/audio_processor.py - Audio file validation and preprocessing

    • validate_audio_file():15 - Checks file existence, format, and size
    • process_audio():50 - Decodes audio using faster_whisper's decode_audio
  8. src/utils/formatters.py - Output format conversion

    • format_vtt(), format_srt(), format_txt(), format_json() - Convert segments to various formats
    • All formatters accept segment lists from Whisper output
  9. src/utils/test_audio_generator.py - Test audio generation

    • generate_test_audio() - Creates synthetic 1-second audio for GPU health checks
    • Uses numpy to generate sine wave, no external audio files needed
  10. src/core/gpu_reset.py - GPU driver reset with cooldown protection

    • reset_gpu_driver() - Executes reset_gpu.sh script to reload NVIDIA drivers
    • check_reset_cooldown() - Validates if enough time has passed since last reset
    • Cooldown timestamp persists in /tmp/whisper-gpu-last-reset
    • Prevents reset loops while allowing recovery from sleep/wake issues
  11. src/utils/startup.py - Startup sequence orchestration

    • startup_sequence() - Coordinates GPU health check, queue initialization
    • cleanup_on_shutdown() - Cleanup handler for graceful shutdown
    • Centralizes startup logic shared by both servers
  12. src/utils/circuit_breaker.py - Circuit breaker pattern implementation

    • Provides fault tolerance for external service calls
    • Prevents cascading failures
  13. src/utils/input_validation.py - Input validation utilities

    • Validates and sanitizes user inputs
    • Security layer for API endpoints

Key Architecture Patterns

  • Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
  • Model Caching: Models are cached in model_instances dictionary with key format {model_name}_{device}_{compute_type} (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
  • Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
  • Environment Variable Configuration: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
  • GPU-Only Mode: Service is configured for GPU-only operation. device="auto" requires CUDA, device="cpu" is rejected (src/core/model_manager.py:64-90)
  • Async Job Queue: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
  • GPU Health Monitoring: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU

Environment Variables

All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:

API Server Specific:

  • API_HOST - API server host (default: 0.0.0.0)
  • API_PORT - API server port (default: 8000)

Job Queue Configuration (if using async features):

  • JOB_QUEUE_MAX_SIZE - Maximum queue size (default: 100)
  • JOB_METADATA_DIR - Directory for job metadata JSON files
  • JOB_RETENTION_DAYS - Auto-cleanup old jobs (0=disabled)

GPU Health Monitoring:

  • GPU_HEALTH_CHECK_ENABLED - Enable periodic GPU monitoring (true/false)
  • GPU_HEALTH_CHECK_INTERVAL_MINUTES - Monitoring interval (default: 10)
  • GPU_HEALTH_TEST_MODEL - Model for health checks (default: tiny)

GPU Auto-Reset Configuration:

  • GPU_RESET_COOLDOWN_MINUTES - Minimum time between GPU reset attempts (default: 5 minutes)
    • Prevents reset loops while allowing recovery from sleep/wake cycles
    • Auto-reset is enabled by default
    • Service terminates if GPU unavailable after reset attempt

Transcription Configuration (shared by both servers):

  • CUDA_VISIBLE_DEVICES - GPU device selection
  • WHISPER_MODEL_DIR - Model storage location (defaults to None for HuggingFace cache)
  • TRANSCRIPTION_OUTPUT_DIR - Default output directory for single transcriptions
  • TRANSCRIPTION_BATCH_OUTPUT_DIR - Default output directory for batch processing
  • TRANSCRIPTION_MODEL - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
  • TRANSCRIPTION_DEVICE - Execution device (cuda, auto) - NOTE: cpu is rejected in GPU-only mode
  • TRANSCRIPTION_COMPUTE_TYPE - Computation type (float16, int8, auto)
  • TRANSCRIPTION_OUTPUT_FORMAT - Output format (vtt, srt, txt, json)
  • TRANSCRIPTION_BEAM_SIZE - Beam search size (default: 5)
  • TRANSCRIPTION_TEMPERATURE - Sampling temperature (default: 0.0)
  • TRANSCRIPTION_USE_TIMESTAMP - Add timestamp to filenames (true/false)
  • TRANSCRIPTION_FILENAME_PREFIX - Prefix for output filenames
  • TRANSCRIPTION_FILENAME_SUFFIX - Suffix for output filenames
  • TRANSCRIPTION_LANGUAGE - Language code (zh, en, ja, etc., auto-detect if not set)

Supported Configurations

  • Models: tiny, base, small, medium, large-v1, large-v2, large-v3
  • Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
  • Output formats: vtt, srt, json, txt
  • Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)

REST API Endpoints

The REST API server provides the following HTTP endpoints:

GET /

Returns API information and available endpoints.

GET /health

Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.

GET /models

Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).

POST /transcribe

Transcribe a single audio file that exists on the server.

Request Body:

{
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
}

Response:

{
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
}

POST /batch-transcribe

Batch transcribe all audio files in a folder.

Request Body:

{
  "audio_folder": "/path/to/audio/folder",
  "output_folder": "/path/to/output",
  "model_name": "large-v3",
  "output_format": "txt",
  ...
}

Response:

{
  "success": true,
  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}

POST /upload-transcribe

Upload an audio file and transcribe it immediately. Returns the transcription file as a download.

Form Data:

  • file: Audio file (multipart/form-data)
  • model_name: Model name (default: "large-v3")
  • device: Device (default: "auto")
  • output_format: Output format (default: "txt")
  • ... (other transcription parameters)

Response: Returns the transcription file for download.

API Usage Examples

# Get model information
curl http://localhost:8000/models

# Transcribe existing file (synchronous)
curl -X POST http://localhost:8000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'

# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
  -F "file=@audio.mp3" \
  -F "output_format=txt" \
  -F "model_name=large-v3"

# Async job queue (if enabled)
# Submit job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3"}'
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}

# Check status
curl http://localhost:8000/jobs/abc-123
# Returns: {"status": "running", ...}

# Get result (when completed)
curl http://localhost:8000/jobs/abc-123/result
# Returns: transcription text

# Check GPU health
curl http://localhost:8000/health/gpu
# Returns: {"gpu_available": true, "gpu_working": true, ...}

GPU Auto-Reset Configuration

Overview

This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is enabled by default and includes cooldown protection to prevent reset loops.

How It Works

  1. Startup Check: When the service starts, it performs a GPU health check

    • If CUDA errors detected → automatic reset attempt → retry
    • If retry fails → service terminates
  2. Runtime Check: Before job submission and model loading

    • If CUDA errors detected → automatic reset attempt → retry
    • If retry fails → job rejected, service continues
  3. Cooldown Protection: Prevents reset loops

    • Minimum 5 minutes between reset attempts (configurable via GPU_RESET_COOLDOWN_MINUTES)
    • Cooldown persists across restarts (stored in /tmp/whisper-gpu-last-reset)
    • If reset needed but cooldown active → service/job fails immediately

Manual GPU Reset

You can manually reset the GPU anytime:

./reset_gpu.sh

Or clear the cooldown to allow immediate reset:

from core.gpu_reset import clear_reset_cooldown
clear_reset_cooldown()

Behavior Examples

After sleep/wake with GPU issue:

Service starts → GPU check fails (CUDA error)
→ Cooldown OK → Reset drivers → Wait 3s → Retry
→ Success → Service continues

Multiple failures (hardware issue):

First failure → Reset → Retry fails → Job fails
Second failure within 5 min → Cooldown active → Fail immediately
(Prevents reset loop)

Normal operation:

No CUDA errors → No resets → Normal performance
Reset only happens on actual CUDA failures

Important Implementation Details

GPU-Only Architecture

  • CRITICAL: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
  • device="auto" requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
  • GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
  • If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
  • GPU Auto-Reset: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)

Model Management

  • GPU memory is checked before loading models (src/core/model_manager.py:115-127)
  • Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
  • Models are cached globally in model_instances dict, shared across requests
  • Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)

Transcription Settings

  • VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
  • Word timestamps are enabled by default (src/core/transcriber.py:107)
  • Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
  • Default output format is "txt" for REST API, configured via environment variables for MCP server

Async Job Queue (if enabled)

  • Single worker thread processes jobs sequentially (prevents GPU memory contention)
  • Jobs persist to disk as JSON files in JOB_METADATA_DIR
  • Queue has max size limit (default 100), returns 503 when full
  • Job status polling recommended every 5-10 seconds for LLM agents

Development Workflow

Running Tests

The test suite requires GPU access. Ensure CUDA is properly configured before running tests.

# Set PYTHONPATH to include src directory
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"

# Run core component tests (GPU health, job queue, audio validation)
python tests/test_core_components.py

# Run end-to-end integration tests
python tests/test_e2e_integration.py

# Run async API integration tests
python tests/test_async_api_integration.py

Tests will automatically:

  • Check for GPU availability (exit if not available)
  • Validate audio file processing
  • Test GPU health monitoring
  • Test job queue operations
  • Test transcription pipeline

Testing GPU Health

# Test GPU health check manually
from src.core.gpu_health import check_gpu_health

status = check_gpu_health(expected_device="cuda")
print(f"GPU Working: {status.gpu_working}")
print(f"Device: {status.device_used}")
print(f"Test Duration: {status.test_duration_seconds}s")
# Expected: <1s for GPU, 3-10s for CPU

Testing Job Queue

# Test job queue manually
from src.core.job_queue import JobQueue

queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
queue.start()

# Submit job
job_info = queue.submit_job(
    audio_path="/path/to/test.mp3",
    model_name="large-v3",
    device="cuda"
)
print(f"Job ID: {job_info['job_id']}")

# Poll status
status = queue.get_job_status(job_info['job_id'])
print(f"Status: {status['status']}")

# Get result when completed
result = queue.get_job_result(job_info['job_id'])

Common Debugging

Model loading issues:

  • Check WHISPER_MODEL_DIR is set correctly
  • Verify GPU memory with nvidia-smi
  • Check logs for GPU driver test failures at model_manager.py:112-114

GPU not detected:

  • Verify CUDA_VISIBLE_DEVICES is set correctly
  • Check torch.cuda.is_available() returns True
  • Run GPU health check to see detailed error

Silent failures:

  • Check that service is NOT silently falling back to CPU
  • GPU health check should RAISE errors, not log warnings
  • If device=cuda fails, the job should be rejected, not processed on CPU

Job queue issues:

  • Check JOB_METADATA_DIR exists and is writable
  • Verify background worker thread is running (check logs)
  • Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json

File Locations

  • Logs: mcp.logs (MCP server), api.logs (API server)
  • Models: $WHISPER_MODEL_DIR or HuggingFace cache
  • Outputs: $TRANSCRIPTION_OUTPUT_DIR or $TRANSCRIPTION_BATCH_OUTPUT_DIR
  • Job Metadata: $JOB_METADATA_DIR/{job_id}.json

Important Development Notes

  • See DEV_PLAN.md for detailed architecture and implementation plan for async job queue features
  • The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
  • When modifying model_manager.py, maintain the strict GPU-only enforcement
  • When adding new endpoints, follow the async pattern if transcription time >30 seconds