Files
Fast-Whisper-MCP-Server/CLAUDE.md
Alihan 1292f0f09b Add GPU auto-reset, job queue, health monitoring, and test infrastructure
Major features:
- GPU auto-reset on CUDA errors with cooldown protection (handles sleep/wake)
- Async job queue system for long-running transcriptions
- Comprehensive GPU health monitoring with real model tests
- Phase 1 component testing with detailed logging

New modules:
- src/core/gpu_reset.py: GPU driver reset with 5-min cooldown
- src/core/gpu_health.py: Real GPU health checks using model inference
- src/core/job_queue.py: FIFO queue with background worker and persistence
- src/utils/test_audio_generator.py: Test audio generation for GPU checks
- test_phase1.py: Component tests with logging
- reset_gpu.sh: GPU driver reset script

Updates:
- CLAUDE.md: Added GPU auto-reset docs and passwordless sudo setup
- requirements.txt: Updated to PyTorch CUDA 12.4
- Model manager: Integrated GPU health check with reset
- Both servers: Added startup GPU validation with auto-reset
- Startup scripts: Added GPU_RESET_COOLDOWN_MINUTES env var
2025-10-09 23:13:11 +03:00

20 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:

  1. MCP Server - For integration with Claude Desktop and other MCP clients
  2. REST API Server - For HTTP-based integrations with async job queue support

Both servers share the same core transcription logic and can run independently or simultaneously on different ports.

Key Features:

  • Async job queue system for long-running transcriptions (prevents HTTP timeouts)
  • GPU health monitoring with strict failure detection (prevents silent CPU fallback)
  • Automatic GPU driver reset on CUDA errors with cooldown protection (handles sleep/wake issues)
  • Dual-server architecture (MCP + REST API)
  • Model caching for fast repeated transcriptions
  • Automatic batch size optimization based on GPU memory

Development Commands

Environment Setup

# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Running the Servers

MCP Server (for Claude Desktop)

# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh

# Direct Python execution
python whisper_server.py

# Using MCP CLI for development testing
mcp dev whisper_server.py

# Run server with MCP CLI
mcp run whisper_server.py

REST API Server (for HTTP clients)

# Using the startup script (recommended - sets all env vars)
./run_api_server.sh

# Direct Python execution with uvicorn
python api_server.py

# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000

# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000

Running Both Simultaneously

# Terminal 1: Start MCP server
./run_mcp_server.sh

# Terminal 2: Start REST API server
./run_api_server.sh

Docker

# Build Docker image
docker build -t whisper-mcp-server .

# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server

Architecture

Directory Structure

.
├── src/                          # Source code directory
│   ├── servers/                  # Server implementations
│   │   ├── whisper_server.py    # MCP server entry point
│   │   └── api_server.py        # REST API server (async job queue)
│   ├── core/                     # Core business logic
│   │   ├── transcriber.py       # Transcription logic (single & batch)
│   │   ├── model_manager.py     # Model lifecycle & caching
│   │   ├── job_queue.py         # Async job queue manager
│   │   └── gpu_health.py        # GPU health monitoring
│   └── utils/                    # Utility modules
│       ├── audio_processor.py   # Audio validation & preprocessing
│       ├── formatters.py        # Output format conversion
│       └── test_audio_generator.py # Test audio generation for GPU checks
├── run_mcp_server.sh            # MCP server startup script
├── run_api_server.sh            # API server startup script
├── reset_gpu.sh                 # GPU driver reset script
├── DEV_PLAN.md                  # Development plan for async features
├── requirements.txt              # Python dependencies
└── pyproject.toml               # Project configuration

Core Components

  1. src/servers/whisper_server.py - MCP server entry point

    • Uses FastMCP framework to expose MCP tools
    • Three main tools: get_model_info_api(), transcribe(), batch_transcribe_audio()
    • Server initialization at line 19
  2. src/servers/api_server.py - REST API server entry point

    • Uses FastAPI framework for HTTP endpoints
    • Provides REST endpoints: /, /health, /models, /transcribe, /batch-transcribe, /upload-transcribe
    • Shares core transcription logic with MCP server
    • File upload support via multipart/form-data
  3. src/core/transcriber.py - Core transcription logic (shared by both servers)

    • transcribe_audio():39 - Single file transcription with environment variable support
    • batch_transcribe():209 - Batch processing with progress reporting
    • All parameters support environment variable defaults (lines 21-37)
    • Delegates output formatting to utils.formatters
  4. src/core/model_manager.py - Whisper model lifecycle management

    • get_whisper_model():44 - Returns cached model instances or loads new ones
    • test_gpu_driver():20 - GPU validation before model loading
    • CRITICAL: GPU-only mode enforced at lines 64-90 (no CPU fallback)
    • Global model_instances dict caches loaded models to prevent reloading
    • Automatic batch size optimization based on GPU memory (lines 134-147)
  5. src/core/job_queue.py - Async job queue manager

    • JobQueue class manages FIFO queue with background worker thread
    • submit_job() - Validates audio, checks GPU health, adds to queue
    • get_job_status() - Returns current job status and queue position
    • get_job_result() - Returns transcription result for completed jobs
    • Jobs persist to disk as JSON files for crash recovery
    • Single worker thread processes jobs sequentially (prevents GPU contention)
  6. src/core/gpu_health.py - GPU health monitoring

    • check_gpu_health():39 - Real GPU test using tiny model + test audio
    • GPUHealthStatus dataclass contains detailed GPU metrics
    • CRITICAL: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
    • Prevents silent CPU fallback that would cause 10-100x slowdown
    • HealthMonitor class for periodic background monitoring
  7. src/utils/audio_processor.py - Audio file validation and preprocessing

    • validate_audio_file():15 - Checks file existence, format, and size
    • process_audio():50 - Decodes audio using faster_whisper's decode_audio
  8. src/utils/formatters.py - Output format conversion

    • format_vtt(), format_srt(), format_txt(), format_json() - Convert segments to various formats
    • All formatters accept segment lists from Whisper output
  9. src/utils/test_audio_generator.py - Test audio generation

    • generate_test_audio() - Creates synthetic 1-second audio for GPU health checks
    • Uses numpy to generate sine wave, no external audio files needed

Key Architecture Patterns

  • Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
  • Model Caching: Models are cached in model_instances dictionary with key format {model_name}_{device}_{compute_type} (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
  • Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
  • Environment Variable Configuration: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
  • GPU-Only Mode: Service is configured for GPU-only operation. device="auto" requires CUDA, device="cpu" is rejected (src/core/model_manager.py:64-90)
  • Async Job Queue: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
  • GPU Health Monitoring: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU

Environment Variables

All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:

API Server Specific:

  • API_HOST - API server host (default: 0.0.0.0)
  • API_PORT - API server port (default: 8000)

Job Queue Configuration (if using async features):

  • JOB_QUEUE_MAX_SIZE - Maximum queue size (default: 100)
  • JOB_METADATA_DIR - Directory for job metadata JSON files
  • JOB_RETENTION_DAYS - Auto-cleanup old jobs (0=disabled)

GPU Health Monitoring:

  • GPU_HEALTH_CHECK_ENABLED - Enable periodic GPU monitoring (true/false)
  • GPU_HEALTH_CHECK_INTERVAL_MINUTES - Monitoring interval (default: 10)
  • GPU_HEALTH_TEST_MODEL - Model for health checks (default: tiny)

GPU Auto-Reset Configuration:

  • GPU_RESET_COOLDOWN_MINUTES - Minimum time between GPU reset attempts (default: 5 minutes)
    • Prevents reset loops while allowing recovery from sleep/wake cycles
    • Auto-reset is enabled by default
    • Service terminates if GPU unavailable after reset attempt

Transcription Configuration (shared by both servers):

  • CUDA_VISIBLE_DEVICES - GPU device selection
  • WHISPER_MODEL_DIR - Model storage location (defaults to None for HuggingFace cache)
  • TRANSCRIPTION_OUTPUT_DIR - Default output directory for single transcriptions
  • TRANSCRIPTION_BATCH_OUTPUT_DIR - Default output directory for batch processing
  • TRANSCRIPTION_MODEL - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
  • TRANSCRIPTION_DEVICE - Execution device (cuda, auto) - NOTE: cpu is rejected in GPU-only mode
  • TRANSCRIPTION_COMPUTE_TYPE - Computation type (float16, int8, auto)
  • TRANSCRIPTION_OUTPUT_FORMAT - Output format (vtt, srt, txt, json)
  • TRANSCRIPTION_BEAM_SIZE - Beam search size (default: 5)
  • TRANSCRIPTION_TEMPERATURE - Sampling temperature (default: 0.0)
  • TRANSCRIPTION_USE_TIMESTAMP - Add timestamp to filenames (true/false)
  • TRANSCRIPTION_FILENAME_PREFIX - Prefix for output filenames
  • TRANSCRIPTION_FILENAME_SUFFIX - Suffix for output filenames
  • TRANSCRIPTION_LANGUAGE - Language code (zh, en, ja, etc., auto-detect if not set)

Supported Configurations

  • Models: tiny, base, small, medium, large-v1, large-v2, large-v3
  • Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
  • Output formats: vtt, srt, json, txt
  • Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)

REST API Endpoints

The REST API server provides the following HTTP endpoints:

GET /

Returns API information and available endpoints.

GET /health

Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.

GET /models

Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).

POST /transcribe

Transcribe a single audio file that exists on the server.

Request Body:

{
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
}

Response:

{
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
}

POST /batch-transcribe

Batch transcribe all audio files in a folder.

Request Body:

{
  "audio_folder": "/path/to/audio/folder",
  "output_folder": "/path/to/output",
  "model_name": "large-v3",
  "output_format": "txt",
  ...
}

Response:

{
  "success": true,
  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}

POST /upload-transcribe

Upload an audio file and transcribe it immediately. Returns the transcription file as a download.

Form Data:

  • file: Audio file (multipart/form-data)
  • model_name: Model name (default: "large-v3")
  • device: Device (default: "auto")
  • output_format: Output format (default: "txt")
  • ... (other transcription parameters)

Response: Returns the transcription file for download.

API Usage Examples

# Get model information
curl http://localhost:8000/models

# Transcribe existing file (synchronous)
curl -X POST http://localhost:8000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'

# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
  -F "file=@audio.mp3" \
  -F "output_format=txt" \
  -F "model_name=large-v3"

# Async job queue (if enabled)
# Submit job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3"}'
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}

# Check status
curl http://localhost:8000/jobs/abc-123
# Returns: {"status": "running", ...}

# Get result (when completed)
curl http://localhost:8000/jobs/abc-123/result
# Returns: transcription text

# Check GPU health
curl http://localhost:8000/health/gpu
# Returns: {"gpu_available": true, "gpu_working": true, ...}

GPU Auto-Reset Configuration

Overview

This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is enabled by default and includes cooldown protection to prevent reset loops.

Passwordless Sudo Setup (Required)

For automatic GPU reset to work, you must configure passwordless sudo for NVIDIA commands. Create a sudoers configuration file:

sudo visudo -f /etc/sudoers.d/whisper-gpu-reset

Add the following (replace your_username with your actual username):

# Whisper GPU Auto-Reset Permissions
your_username ALL=(ALL) NOPASSWD: /bin/systemctl stop nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /bin/systemctl start nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_drm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_drm

Security Note: These permissions are limited to specific NVIDIA driver commands only. The reset script (reset_gpu.sh) is executed with sudo but is part of the codebase and can be audited.

How It Works

  1. Startup Check: When the service starts, it performs a GPU health check

    • If CUDA errors detected → automatic reset attempt → retry
    • If retry fails → service terminates
  2. Runtime Check: Before job submission and model loading

    • If CUDA errors detected → automatic reset attempt → retry
    • If retry fails → job rejected, service continues
  3. Cooldown Protection: Prevents reset loops

    • Minimum 5 minutes between reset attempts (configurable via GPU_RESET_COOLDOWN_MINUTES)
    • Cooldown persists across restarts (stored in /tmp/whisper-gpu-last-reset)
    • If reset needed but cooldown active → service/job fails immediately

Manual GPU Reset

You can manually reset the GPU anytime:

./reset_gpu.sh

Or clear the cooldown to allow immediate reset:

from core.gpu_reset import clear_reset_cooldown
clear_reset_cooldown()

Behavior Examples

After sleep/wake with GPU issue:

Service starts → GPU check fails (CUDA error)
→ Cooldown OK → Reset drivers → Wait 3s → Retry
→ Success → Service continues

Multiple failures (hardware issue):

First failure → Reset → Retry fails → Job fails
Second failure within 5 min → Cooldown active → Fail immediately
(Prevents reset loop)

Normal operation:

No CUDA errors → No resets → Normal performance
Reset only happens on actual CUDA failures

Important Implementation Details

GPU-Only Architecture

  • CRITICAL: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
  • device="auto" requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
  • GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
  • If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
  • GPU Auto-Reset: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)

Model Management

  • GPU memory is checked before loading models (src/core/model_manager.py:115-127)
  • Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
  • Models are cached globally in model_instances dict, shared across requests
  • Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)

Transcription Settings

  • VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
  • Word timestamps are enabled by default (src/core/transcriber.py:107)
  • Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
  • Default output format is "txt" for REST API, configured via environment variables for MCP server

Async Job Queue (if enabled)

  • Single worker thread processes jobs sequentially (prevents GPU memory contention)
  • Jobs persist to disk as JSON files in JOB_METADATA_DIR
  • Queue has max size limit (default 100), returns 503 when full
  • Job status polling recommended every 5-10 seconds for LLM agents

Development Workflow

Testing GPU Health

# Test GPU health check manually
from src.core.gpu_health import check_gpu_health

status = check_gpu_health(expected_device="cuda")
print(f"GPU Working: {status.gpu_working}")
print(f"Device: {status.device_used}")
print(f"Test Duration: {status.test_duration_seconds}s")
# Expected: <1s for GPU, 3-10s for CPU

Testing Job Queue

# Test job queue manually
from src.core.job_queue import JobQueue

queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
queue.start()

# Submit job
job_info = queue.submit_job(
    audio_path="/path/to/test.mp3",
    model_name="large-v3",
    device="cuda"
)
print(f"Job ID: {job_info['job_id']}")

# Poll status
status = queue.get_job_status(job_info['job_id'])
print(f"Status: {status['status']}")

# Get result when completed
result = queue.get_job_result(job_info['job_id'])

Common Debugging

Model loading issues:

  • Check WHISPER_MODEL_DIR is set correctly
  • Verify GPU memory with nvidia-smi
  • Check logs for GPU driver test failures at model_manager.py:112-114

GPU not detected:

  • Verify CUDA_VISIBLE_DEVICES is set correctly
  • Check torch.cuda.is_available() returns True
  • Run GPU health check to see detailed error

Silent failures:

  • Check that service is NOT silently falling back to CPU
  • GPU health check should RAISE errors, not log warnings
  • If device=cuda fails, the job should be rejected, not processed on CPU

Job queue issues:

  • Check JOB_METADATA_DIR exists and is writable
  • Verify background worker thread is running (check logs)
  • Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json

File Locations

  • Logs: mcp.logs (MCP server), api.logs (API server)
  • Models: $WHISPER_MODEL_DIR or HuggingFace cache
  • Outputs: $TRANSCRIPTION_OUTPUT_DIR or $TRANSCRIPTION_BATCH_OUTPUT_DIR
  • Job Metadata: $JOB_METADATA_DIR/{job_id}.json

Important Development Notes

  • See DEV_PLAN.md for detailed architecture and implementation plan for async job queue features
  • The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
  • When modifying model_manager.py, maintain the strict GPU-only enforcement
  • When adding new endpoints, follow the async pattern if transcription time >30 seconds