Major features: - GPU auto-reset on CUDA errors with cooldown protection (handles sleep/wake) - Async job queue system for long-running transcriptions - Comprehensive GPU health monitoring with real model tests - Phase 1 component testing with detailed logging New modules: - src/core/gpu_reset.py: GPU driver reset with 5-min cooldown - src/core/gpu_health.py: Real GPU health checks using model inference - src/core/job_queue.py: FIFO queue with background worker and persistence - src/utils/test_audio_generator.py: Test audio generation for GPU checks - test_phase1.py: Component tests with logging - reset_gpu.sh: GPU driver reset script Updates: - CLAUDE.md: Added GPU auto-reset docs and passwordless sudo setup - requirements.txt: Updated to PyTorch CUDA 12.4 - Model manager: Integrated GPU health check with reset - Both servers: Added startup GPU validation with auto-reset - Startup scripts: Added GPU_RESET_COOLDOWN_MINUTES env var
20 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:
- MCP Server - For integration with Claude Desktop and other MCP clients
- REST API Server - For HTTP-based integrations with async job queue support
Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
Key Features:
- Async job queue system for long-running transcriptions (prevents HTTP timeouts)
- GPU health monitoring with strict failure detection (prevents silent CPU fallback)
- Automatic GPU driver reset on CUDA errors with cooldown protection (handles sleep/wake issues)
- Dual-server architecture (MCP + REST API)
- Model caching for fast repeated transcriptions
- Automatic batch size optimization based on GPU memory
Development Commands
Environment Setup
# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
Running the Servers
MCP Server (for Claude Desktop)
# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh
# Direct Python execution
python whisper_server.py
# Using MCP CLI for development testing
mcp dev whisper_server.py
# Run server with MCP CLI
mcp run whisper_server.py
REST API Server (for HTTP clients)
# Using the startup script (recommended - sets all env vars)
./run_api_server.sh
# Direct Python execution with uvicorn
python api_server.py
# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000
# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000
Running Both Simultaneously
# Terminal 1: Start MCP server
./run_mcp_server.sh
# Terminal 2: Start REST API server
./run_api_server.sh
Docker
# Build Docker image
docker build -t whisper-mcp-server .
# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
Architecture
Directory Structure
.
├── src/ # Source code directory
│ ├── servers/ # Server implementations
│ │ ├── whisper_server.py # MCP server entry point
│ │ └── api_server.py # REST API server (async job queue)
│ ├── core/ # Core business logic
│ │ ├── transcriber.py # Transcription logic (single & batch)
│ │ ├── model_manager.py # Model lifecycle & caching
│ │ ├── job_queue.py # Async job queue manager
│ │ └── gpu_health.py # GPU health monitoring
│ └── utils/ # Utility modules
│ ├── audio_processor.py # Audio validation & preprocessing
│ ├── formatters.py # Output format conversion
│ └── test_audio_generator.py # Test audio generation for GPU checks
├── run_mcp_server.sh # MCP server startup script
├── run_api_server.sh # API server startup script
├── reset_gpu.sh # GPU driver reset script
├── DEV_PLAN.md # Development plan for async features
├── requirements.txt # Python dependencies
└── pyproject.toml # Project configuration
Core Components
-
src/servers/whisper_server.py - MCP server entry point
- Uses FastMCP framework to expose MCP tools
- Three main tools:
get_model_info_api(),transcribe(),batch_transcribe_audio() - Server initialization at line 19
-
src/servers/api_server.py - REST API server entry point
- Uses FastAPI framework for HTTP endpoints
- Provides REST endpoints:
/,/health,/models,/transcribe,/batch-transcribe,/upload-transcribe - Shares core transcription logic with MCP server
- File upload support via multipart/form-data
-
src/core/transcriber.py - Core transcription logic (shared by both servers)
transcribe_audio():39 - Single file transcription with environment variable supportbatch_transcribe():209 - Batch processing with progress reporting- All parameters support environment variable defaults (lines 21-37)
- Delegates output formatting to utils.formatters
-
src/core/model_manager.py - Whisper model lifecycle management
get_whisper_model():44 - Returns cached model instances or loads new onestest_gpu_driver():20 - GPU validation before model loading- CRITICAL: GPU-only mode enforced at lines 64-90 (no CPU fallback)
- Global
model_instancesdict caches loaded models to prevent reloading - Automatic batch size optimization based on GPU memory (lines 134-147)
-
src/core/job_queue.py - Async job queue manager
JobQueueclass manages FIFO queue with background worker threadsubmit_job()- Validates audio, checks GPU health, adds to queueget_job_status()- Returns current job status and queue positionget_job_result()- Returns transcription result for completed jobs- Jobs persist to disk as JSON files for crash recovery
- Single worker thread processes jobs sequentially (prevents GPU contention)
-
src/core/gpu_health.py - GPU health monitoring
check_gpu_health():39 - Real GPU test using tiny model + test audioGPUHealthStatusdataclass contains detailed GPU metrics- CRITICAL: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
- Prevents silent CPU fallback that would cause 10-100x slowdown
HealthMonitorclass for periodic background monitoring
-
src/utils/audio_processor.py - Audio file validation and preprocessing
validate_audio_file():15 - Checks file existence, format, and sizeprocess_audio():50 - Decodes audio using faster_whisper's decode_audio
-
src/utils/formatters.py - Output format conversion
format_vtt(),format_srt(),format_txt(),format_json()- Convert segments to various formats- All formatters accept segment lists from Whisper output
-
src/utils/test_audio_generator.py - Test audio generation
generate_test_audio()- Creates synthetic 1-second audio for GPU health checks- Uses numpy to generate sine wave, no external audio files needed
Key Architecture Patterns
- Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
- Model Caching: Models are cached in
model_instancesdictionary with key format{model_name}_{device}_{compute_type}(src/core/model_manager.py:104). This cache is shared if both servers run in the same process - Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
- Environment Variable Configuration: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
- GPU-Only Mode: Service is configured for GPU-only operation.
device="auto"requires CUDA,device="cpu"is rejected (src/core/model_manager.py:64-90) - Async Job Queue: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
- GPU Health Monitoring: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU
Environment Variables
All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
API Server Specific:
API_HOST- API server host (default: 0.0.0.0)API_PORT- API server port (default: 8000)
Job Queue Configuration (if using async features):
JOB_QUEUE_MAX_SIZE- Maximum queue size (default: 100)JOB_METADATA_DIR- Directory for job metadata JSON filesJOB_RETENTION_DAYS- Auto-cleanup old jobs (0=disabled)
GPU Health Monitoring:
GPU_HEALTH_CHECK_ENABLED- Enable periodic GPU monitoring (true/false)GPU_HEALTH_CHECK_INTERVAL_MINUTES- Monitoring interval (default: 10)GPU_HEALTH_TEST_MODEL- Model for health checks (default: tiny)
GPU Auto-Reset Configuration:
GPU_RESET_COOLDOWN_MINUTES- Minimum time between GPU reset attempts (default: 5 minutes)- Prevents reset loops while allowing recovery from sleep/wake cycles
- Auto-reset is enabled by default
- Service terminates if GPU unavailable after reset attempt
Transcription Configuration (shared by both servers):
CUDA_VISIBLE_DEVICES- GPU device selectionWHISPER_MODEL_DIR- Model storage location (defaults to None for HuggingFace cache)TRANSCRIPTION_OUTPUT_DIR- Default output directory for single transcriptionsTRANSCRIPTION_BATCH_OUTPUT_DIR- Default output directory for batch processingTRANSCRIPTION_MODEL- Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)TRANSCRIPTION_DEVICE- Execution device (cuda, auto) - NOTE: cpu is rejected in GPU-only modeTRANSCRIPTION_COMPUTE_TYPE- Computation type (float16, int8, auto)TRANSCRIPTION_OUTPUT_FORMAT- Output format (vtt, srt, txt, json)TRANSCRIPTION_BEAM_SIZE- Beam search size (default: 5)TRANSCRIPTION_TEMPERATURE- Sampling temperature (default: 0.0)TRANSCRIPTION_USE_TIMESTAMP- Add timestamp to filenames (true/false)TRANSCRIPTION_FILENAME_PREFIX- Prefix for output filenamesTRANSCRIPTION_FILENAME_SUFFIX- Suffix for output filenamesTRANSCRIPTION_LANGUAGE- Language code (zh, en, ja, etc., auto-detect if not set)
Supported Configurations
- Models: tiny, base, small, medium, large-v1, large-v2, large-v3
- Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
- Output formats: vtt, srt, json, txt
- Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
REST API Endpoints
The REST API server provides the following HTTP endpoints:
GET /
Returns API information and available endpoints.
GET /health
Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.
GET /models
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
POST /transcribe
Transcribe a single audio file that exists on the server.
Request Body:
{
"audio_path": "/path/to/audio.mp3",
"model_name": "large-v3",
"device": "auto",
"compute_type": "auto",
"language": "en",
"output_format": "txt",
"beam_size": 5,
"temperature": 0.0,
"initial_prompt": null,
"output_directory": null
}
Response:
{
"success": true,
"message": "Transcription successful, results saved to: /path/to/output.txt",
"output_path": "/path/to/output.txt"
}
POST /batch-transcribe
Batch transcribe all audio files in a folder.
Request Body:
{
"audio_folder": "/path/to/audio/folder",
"output_folder": "/path/to/output",
"model_name": "large-v3",
"output_format": "txt",
...
}
Response:
{
"success": true,
"summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}
POST /upload-transcribe
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
Form Data:
file: Audio file (multipart/form-data)model_name: Model name (default: "large-v3")device: Device (default: "auto")output_format: Output format (default: "txt")- ... (other transcription parameters)
Response: Returns the transcription file for download.
API Usage Examples
# Get model information
curl http://localhost:8000/models
# Transcribe existing file (synchronous)
curl -X POST http://localhost:8000/transcribe \
-H "Content-Type: application/json" \
-d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
-F "file=@audio.mp3" \
-F "output_format=txt" \
-F "model_name=large-v3"
# Async job queue (if enabled)
# Submit job
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"audio_path": "/path/to/audio.mp3"}'
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}
# Check status
curl http://localhost:8000/jobs/abc-123
# Returns: {"status": "running", ...}
# Get result (when completed)
curl http://localhost:8000/jobs/abc-123/result
# Returns: transcription text
# Check GPU health
curl http://localhost:8000/health/gpu
# Returns: {"gpu_available": true, "gpu_working": true, ...}
GPU Auto-Reset Configuration
Overview
This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is enabled by default and includes cooldown protection to prevent reset loops.
Passwordless Sudo Setup (Required)
For automatic GPU reset to work, you must configure passwordless sudo for NVIDIA commands. Create a sudoers configuration file:
sudo visudo -f /etc/sudoers.d/whisper-gpu-reset
Add the following (replace your_username with your actual username):
# Whisper GPU Auto-Reset Permissions
your_username ALL=(ALL) NOPASSWD: /bin/systemctl stop nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /bin/systemctl start nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_drm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_drm
Security Note: These permissions are limited to specific NVIDIA driver commands only. The reset script (reset_gpu.sh) is executed with sudo but is part of the codebase and can be audited.
How It Works
-
Startup Check: When the service starts, it performs a GPU health check
- If CUDA errors detected → automatic reset attempt → retry
- If retry fails → service terminates
-
Runtime Check: Before job submission and model loading
- If CUDA errors detected → automatic reset attempt → retry
- If retry fails → job rejected, service continues
-
Cooldown Protection: Prevents reset loops
- Minimum 5 minutes between reset attempts (configurable via
GPU_RESET_COOLDOWN_MINUTES) - Cooldown persists across restarts (stored in
/tmp/whisper-gpu-last-reset) - If reset needed but cooldown active → service/job fails immediately
- Minimum 5 minutes between reset attempts (configurable via
Manual GPU Reset
You can manually reset the GPU anytime:
./reset_gpu.sh
Or clear the cooldown to allow immediate reset:
from core.gpu_reset import clear_reset_cooldown
clear_reset_cooldown()
Behavior Examples
After sleep/wake with GPU issue:
Service starts → GPU check fails (CUDA error)
→ Cooldown OK → Reset drivers → Wait 3s → Retry
→ Success → Service continues
Multiple failures (hardware issue):
First failure → Reset → Retry fails → Job fails
Second failure within 5 min → Cooldown active → Fail immediately
(Prevents reset loop)
Normal operation:
No CUDA errors → No resets → Normal performance
Reset only happens on actual CUDA failures
Important Implementation Details
GPU-Only Architecture
- CRITICAL: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
device="auto"requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
- If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
- GPU Auto-Reset: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)
Model Management
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
- Models are cached globally in
model_instancesdict, shared across requests - Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)
Transcription Settings
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
- Word timestamps are enabled by default (src/core/transcriber.py:107)
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
- Default output format is "txt" for REST API, configured via environment variables for MCP server
Async Job Queue (if enabled)
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
- Queue has max size limit (default 100), returns 503 when full
- Job status polling recommended every 5-10 seconds for LLM agents
Development Workflow
Testing GPU Health
# Test GPU health check manually
from src.core.gpu_health import check_gpu_health
status = check_gpu_health(expected_device="cuda")
print(f"GPU Working: {status.gpu_working}")
print(f"Device: {status.device_used}")
print(f"Test Duration: {status.test_duration_seconds}s")
# Expected: <1s for GPU, 3-10s for CPU
Testing Job Queue
# Test job queue manually
from src.core.job_queue import JobQueue
queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
queue.start()
# Submit job
job_info = queue.submit_job(
audio_path="/path/to/test.mp3",
model_name="large-v3",
device="cuda"
)
print(f"Job ID: {job_info['job_id']}")
# Poll status
status = queue.get_job_status(job_info['job_id'])
print(f"Status: {status['status']}")
# Get result when completed
result = queue.get_job_result(job_info['job_id'])
Common Debugging
Model loading issues:
- Check
WHISPER_MODEL_DIRis set correctly - Verify GPU memory with
nvidia-smi - Check logs for GPU driver test failures at model_manager.py:112-114
GPU not detected:
- Verify
CUDA_VISIBLE_DEVICESis set correctly - Check
torch.cuda.is_available()returns True - Run GPU health check to see detailed error
Silent failures:
- Check that service is NOT silently falling back to CPU
- GPU health check should RAISE errors, not log warnings
- If device=cuda fails, the job should be rejected, not processed on CPU
Job queue issues:
- Check
JOB_METADATA_DIRexists and is writable - Verify background worker thread is running (check logs)
- Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json
File Locations
- Logs:
mcp.logs(MCP server),api.logs(API server) - Models:
$WHISPER_MODEL_DIRor HuggingFace cache - Outputs:
$TRANSCRIPTION_OUTPUT_DIRor$TRANSCRIPTION_BATCH_OUTPUT_DIR - Job Metadata:
$JOB_METADATA_DIR/{job_id}.json
Important Development Notes
- See
DEV_PLAN.mdfor detailed architecture and implementation plan for async job queue features - The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
- When modifying model_manager.py, maintain the strict GPU-only enforcement
- When adding new endpoints, follow the async pattern if transcription time >30 seconds