- Update CLAUDE.md with new test suite documentation - Add PYTHONPATH instructions for direct execution - Document new utility modules (startup, circuit_breaker, input_validation) - Remove passwordless sudo section from GPU auto-reset docs - Reduce job queue max size to 5 in API server config - Rename supervisor program to transcriptor-api - Remove log files from repository
22 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:
- MCP Server - For integration with Claude Desktop and other MCP clients
- REST API Server - For HTTP-based integrations with async job queue support
Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
Key Features:
- Async job queue system for long-running transcriptions (prevents HTTP timeouts)
- GPU health monitoring with strict failure detection (prevents silent CPU fallback)
- Automatic GPU driver reset on CUDA errors with cooldown protection (handles sleep/wake issues)
- Dual-server architecture (MCP + REST API)
- Model caching for fast repeated transcriptions
- Automatic batch size optimization based on GPU memory
Development Commands
Environment Setup
# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
Running the Servers
MCP Server (for Claude Desktop)
# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh
# Direct Python execution (ensure PYTHONPATH includes src/)
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
python src/servers/whisper_server.py
# Using MCP CLI for development testing
mcp dev src/servers/whisper_server.py
REST API Server (for HTTP clients)
# Using the startup script (recommended - sets all env vars)
./run_api_server.sh
# Direct Python execution with uvicorn (ensure PYTHONPATH includes src/)
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
uvicorn src.servers.api_server:app --host 0.0.0.0 --port 8000
# Development mode with auto-reload
uvicorn src.servers.api_server:app --reload --host 0.0.0.0 --port 8000
Running Both Simultaneously
# Terminal 1: Start MCP server
./run_mcp_server.sh
# Terminal 2: Start REST API server
./run_api_server.sh
Running Tests
# Run all tests (requires GPU)
python tests/test_core_components.py
python tests/test_e2e_integration.py
python tests/test_async_api_integration.py
# Or run individual test components
cd tests && python test_core_components.py
Important: All tests require GPU to be available. Tests will fail if CUDA is not properly configured.
Docker
# Build Docker image
docker build -t whisper-mcp-server .
# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
Architecture
Directory Structure
.
├── src/ # Source code directory
│ ├── servers/ # Server implementations
│ │ ├── whisper_server.py # MCP server entry point
│ │ └── api_server.py # REST API server (async job queue)
│ ├── core/ # Core business logic
│ │ ├── transcriber.py # Transcription logic (single & batch)
│ │ ├── model_manager.py # Model lifecycle & caching
│ │ ├── job_queue.py # Async job queue manager
│ │ ├── gpu_health.py # GPU health monitoring
│ │ └── gpu_reset.py # GPU driver reset with cooldown
│ └── utils/ # Utility modules
│ ├── audio_processor.py # Audio validation & preprocessing
│ ├── formatters.py # Output format conversion
│ ├── test_audio_generator.py # Test audio generation for GPU checks
│ ├── startup.py # Startup sequence orchestration
│ ├── circuit_breaker.py # Circuit breaker pattern implementation
│ └── input_validation.py # Input validation utilities
├── tests/ # Test suite (requires GPU)
│ ├── test_core_components.py # Core functionality tests
│ ├── test_e2e_integration.py # End-to-end integration tests
│ └── test_async_api_integration.py # Async API tests
├── run_mcp_server.sh # MCP server startup script
├── run_api_server.sh # API server startup script
├── reset_gpu.sh # GPU driver reset script
├── DEV_PLAN.md # Development plan for async features
├── requirements.txt # Python dependencies
└── pyproject.toml # Project configuration
Core Components
-
src/servers/whisper_server.py - MCP server entry point
- Uses FastMCP framework to expose MCP tools
- Main tools:
get_model_info_api(),transcribe_async(),transcribe_upload(),check_job_status(),get_job_result() - Global job queue and health monitor instances
- Server initialization around line 31
-
src/servers/api_server.py - REST API server entry point
- Uses FastAPI framework for HTTP endpoints
- Provides REST endpoints:
/,/health,/models,/transcribe,/batch-transcribe,/upload-transcribe - Shares core transcription logic with MCP server
- File upload support via multipart/form-data
-
src/core/transcriber.py - Core transcription logic (shared by both servers)
transcribe_audio():39 - Single file transcription with environment variable supportbatch_transcribe():209 - Batch processing with progress reporting- All parameters support environment variable defaults (lines 21-37)
- Delegates output formatting to utils.formatters
-
src/core/model_manager.py - Whisper model lifecycle management
get_whisper_model():44 - Returns cached model instances or loads new onestest_gpu_driver():20 - GPU validation before model loading- CRITICAL: GPU-only mode enforced at lines 64-90 (no CPU fallback)
- Global
model_instancesdict caches loaded models to prevent reloading - Automatic batch size optimization based on GPU memory (lines 134-147)
-
src/core/job_queue.py - Async job queue manager
JobQueueclass manages FIFO queue with background worker threadsubmit_job()- Validates audio, checks GPU health, adds to queueget_job_status()- Returns current job status and queue positionget_job_result()- Returns transcription result for completed jobs- Jobs persist to disk as JSON files for crash recovery
- Single worker thread processes jobs sequentially (prevents GPU contention)
-
src/core/gpu_health.py - GPU health monitoring
check_gpu_health():39 - Real GPU test using tiny model + test audioGPUHealthStatusdataclass contains detailed GPU metrics- CRITICAL: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
- Prevents silent CPU fallback that would cause 10-100x slowdown
HealthMonitorclass for periodic background monitoring
-
src/utils/audio_processor.py - Audio file validation and preprocessing
validate_audio_file():15 - Checks file existence, format, and sizeprocess_audio():50 - Decodes audio using faster_whisper's decode_audio
-
src/utils/formatters.py - Output format conversion
format_vtt(),format_srt(),format_txt(),format_json()- Convert segments to various formats- All formatters accept segment lists from Whisper output
-
src/utils/test_audio_generator.py - Test audio generation
generate_test_audio()- Creates synthetic 1-second audio for GPU health checks- Uses numpy to generate sine wave, no external audio files needed
-
src/core/gpu_reset.py - GPU driver reset with cooldown protection
reset_gpu_driver()- Executes reset_gpu.sh script to reload NVIDIA driverscheck_reset_cooldown()- Validates if enough time has passed since last reset- Cooldown timestamp persists in
/tmp/whisper-gpu-last-reset - Prevents reset loops while allowing recovery from sleep/wake issues
-
src/utils/startup.py - Startup sequence orchestration
startup_sequence()- Coordinates GPU health check, queue initializationcleanup_on_shutdown()- Cleanup handler for graceful shutdown- Centralizes startup logic shared by both servers
-
src/utils/circuit_breaker.py - Circuit breaker pattern implementation
- Provides fault tolerance for external service calls
- Prevents cascading failures
-
src/utils/input_validation.py - Input validation utilities
- Validates and sanitizes user inputs
- Security layer for API endpoints
Key Architecture Patterns
- Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
- Model Caching: Models are cached in
model_instancesdictionary with key format{model_name}_{device}_{compute_type}(src/core/model_manager.py:104). This cache is shared if both servers run in the same process - Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
- Environment Variable Configuration: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
- GPU-Only Mode: Service is configured for GPU-only operation.
device="auto"requires CUDA,device="cpu"is rejected (src/core/model_manager.py:64-90) - Async Job Queue: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
- GPU Health Monitoring: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU
Environment Variables
All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
API Server Specific:
API_HOST- API server host (default: 0.0.0.0)API_PORT- API server port (default: 8000)
Job Queue Configuration (if using async features):
JOB_QUEUE_MAX_SIZE- Maximum queue size (default: 100)JOB_METADATA_DIR- Directory for job metadata JSON filesJOB_RETENTION_DAYS- Auto-cleanup old jobs (0=disabled)
GPU Health Monitoring:
GPU_HEALTH_CHECK_ENABLED- Enable periodic GPU monitoring (true/false)GPU_HEALTH_CHECK_INTERVAL_MINUTES- Monitoring interval (default: 10)GPU_HEALTH_TEST_MODEL- Model for health checks (default: tiny)
GPU Auto-Reset Configuration:
GPU_RESET_COOLDOWN_MINUTES- Minimum time between GPU reset attempts (default: 5 minutes)- Prevents reset loops while allowing recovery from sleep/wake cycles
- Auto-reset is enabled by default
- Service terminates if GPU unavailable after reset attempt
Transcription Configuration (shared by both servers):
CUDA_VISIBLE_DEVICES- GPU device selectionWHISPER_MODEL_DIR- Model storage location (defaults to None for HuggingFace cache)TRANSCRIPTION_OUTPUT_DIR- Default output directory for single transcriptionsTRANSCRIPTION_BATCH_OUTPUT_DIR- Default output directory for batch processingTRANSCRIPTION_MODEL- Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)TRANSCRIPTION_DEVICE- Execution device (cuda, auto) - NOTE: cpu is rejected in GPU-only modeTRANSCRIPTION_COMPUTE_TYPE- Computation type (float16, int8, auto)TRANSCRIPTION_OUTPUT_FORMAT- Output format (vtt, srt, txt, json)TRANSCRIPTION_BEAM_SIZE- Beam search size (default: 5)TRANSCRIPTION_TEMPERATURE- Sampling temperature (default: 0.0)TRANSCRIPTION_USE_TIMESTAMP- Add timestamp to filenames (true/false)TRANSCRIPTION_FILENAME_PREFIX- Prefix for output filenamesTRANSCRIPTION_FILENAME_SUFFIX- Suffix for output filenamesTRANSCRIPTION_LANGUAGE- Language code (zh, en, ja, etc., auto-detect if not set)
Supported Configurations
- Models: tiny, base, small, medium, large-v1, large-v2, large-v3
- Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
- Output formats: vtt, srt, json, txt
- Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
REST API Endpoints
The REST API server provides the following HTTP endpoints:
GET /
Returns API information and available endpoints.
GET /health
Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.
GET /models
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
POST /transcribe
Transcribe a single audio file that exists on the server.
Request Body:
{
"audio_path": "/path/to/audio.mp3",
"model_name": "large-v3",
"device": "auto",
"compute_type": "auto",
"language": "en",
"output_format": "txt",
"beam_size": 5,
"temperature": 0.0,
"initial_prompt": null,
"output_directory": null
}
Response:
{
"success": true,
"message": "Transcription successful, results saved to: /path/to/output.txt",
"output_path": "/path/to/output.txt"
}
POST /batch-transcribe
Batch transcribe all audio files in a folder.
Request Body:
{
"audio_folder": "/path/to/audio/folder",
"output_folder": "/path/to/output",
"model_name": "large-v3",
"output_format": "txt",
...
}
Response:
{
"success": true,
"summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}
POST /upload-transcribe
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
Form Data:
file: Audio file (multipart/form-data)model_name: Model name (default: "large-v3")device: Device (default: "auto")output_format: Output format (default: "txt")- ... (other transcription parameters)
Response: Returns the transcription file for download.
API Usage Examples
# Get model information
curl http://localhost:8000/models
# Transcribe existing file (synchronous)
curl -X POST http://localhost:8000/transcribe \
-H "Content-Type: application/json" \
-d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
-F "file=@audio.mp3" \
-F "output_format=txt" \
-F "model_name=large-v3"
# Async job queue (if enabled)
# Submit job
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"audio_path": "/path/to/audio.mp3"}'
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}
# Check status
curl http://localhost:8000/jobs/abc-123
# Returns: {"status": "running", ...}
# Get result (when completed)
curl http://localhost:8000/jobs/abc-123/result
# Returns: transcription text
# Check GPU health
curl http://localhost:8000/health/gpu
# Returns: {"gpu_available": true, "gpu_working": true, ...}
GPU Auto-Reset Configuration
Overview
This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is enabled by default and includes cooldown protection to prevent reset loops.
How It Works
-
Startup Check: When the service starts, it performs a GPU health check
- If CUDA errors detected → automatic reset attempt → retry
- If retry fails → service terminates
-
Runtime Check: Before job submission and model loading
- If CUDA errors detected → automatic reset attempt → retry
- If retry fails → job rejected, service continues
-
Cooldown Protection: Prevents reset loops
- Minimum 5 minutes between reset attempts (configurable via
GPU_RESET_COOLDOWN_MINUTES) - Cooldown persists across restarts (stored in
/tmp/whisper-gpu-last-reset) - If reset needed but cooldown active → service/job fails immediately
- Minimum 5 minutes between reset attempts (configurable via
Manual GPU Reset
You can manually reset the GPU anytime:
./reset_gpu.sh
Or clear the cooldown to allow immediate reset:
from core.gpu_reset import clear_reset_cooldown
clear_reset_cooldown()
Behavior Examples
After sleep/wake with GPU issue:
Service starts → GPU check fails (CUDA error)
→ Cooldown OK → Reset drivers → Wait 3s → Retry
→ Success → Service continues
Multiple failures (hardware issue):
First failure → Reset → Retry fails → Job fails
Second failure within 5 min → Cooldown active → Fail immediately
(Prevents reset loop)
Normal operation:
No CUDA errors → No resets → Normal performance
Reset only happens on actual CUDA failures
Important Implementation Details
GPU-Only Architecture
- CRITICAL: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
device="auto"requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
- If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
- GPU Auto-Reset: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)
Model Management
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
- Models are cached globally in
model_instancesdict, shared across requests - Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)
Transcription Settings
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
- Word timestamps are enabled by default (src/core/transcriber.py:107)
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
- Default output format is "txt" for REST API, configured via environment variables for MCP server
Async Job Queue (if enabled)
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
- Queue has max size limit (default 100), returns 503 when full
- Job status polling recommended every 5-10 seconds for LLM agents
Development Workflow
Running Tests
The test suite requires GPU access. Ensure CUDA is properly configured before running tests.
# Set PYTHONPATH to include src directory
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
# Run core component tests (GPU health, job queue, audio validation)
python tests/test_core_components.py
# Run end-to-end integration tests
python tests/test_e2e_integration.py
# Run async API integration tests
python tests/test_async_api_integration.py
Tests will automatically:
- Check for GPU availability (exit if not available)
- Validate audio file processing
- Test GPU health monitoring
- Test job queue operations
- Test transcription pipeline
Testing GPU Health
# Test GPU health check manually
from src.core.gpu_health import check_gpu_health
status = check_gpu_health(expected_device="cuda")
print(f"GPU Working: {status.gpu_working}")
print(f"Device: {status.device_used}")
print(f"Test Duration: {status.test_duration_seconds}s")
# Expected: <1s for GPU, 3-10s for CPU
Testing Job Queue
# Test job queue manually
from src.core.job_queue import JobQueue
queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
queue.start()
# Submit job
job_info = queue.submit_job(
audio_path="/path/to/test.mp3",
model_name="large-v3",
device="cuda"
)
print(f"Job ID: {job_info['job_id']}")
# Poll status
status = queue.get_job_status(job_info['job_id'])
print(f"Status: {status['status']}")
# Get result when completed
result = queue.get_job_result(job_info['job_id'])
Common Debugging
Model loading issues:
- Check
WHISPER_MODEL_DIRis set correctly - Verify GPU memory with
nvidia-smi - Check logs for GPU driver test failures at model_manager.py:112-114
GPU not detected:
- Verify
CUDA_VISIBLE_DEVICESis set correctly - Check
torch.cuda.is_available()returns True - Run GPU health check to see detailed error
Silent failures:
- Check that service is NOT silently falling back to CPU
- GPU health check should RAISE errors, not log warnings
- If device=cuda fails, the job should be rejected, not processed on CPU
Job queue issues:
- Check
JOB_METADATA_DIRexists and is writable - Verify background worker thread is running (check logs)
- Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json
File Locations
- Logs:
mcp.logs(MCP server),api.logs(API server) - Models:
$WHISPER_MODEL_DIRor HuggingFace cache - Outputs:
$TRANSCRIPTION_OUTPUT_DIRor$TRANSCRIPTION_BATCH_OUTPUT_DIR - Job Metadata:
$JOB_METADATA_DIR/{job_id}.json
Important Development Notes
- See
DEV_PLAN.mdfor detailed architecture and implementation plan for async job queue features - The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
- When modifying model_manager.py, maintain the strict GPU-only enforcement
- When adding new endpoints, follow the async pattern if transcription time >30 seconds