9.2 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service can run as either:
- MCP Server - For integration with Claude Desktop and other MCP clients
- REST API Server - For HTTP-based integrations
Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
Development Commands
Environment Setup
# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
Running the Servers
MCP Server (for Claude Desktop)
# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh
# Direct Python execution
python whisper_server.py
# Using MCP CLI for development testing
mcp dev whisper_server.py
# Run server with MCP CLI
mcp run whisper_server.py
REST API Server (for HTTP clients)
# Using the startup script (recommended - sets all env vars)
./run_api_server.sh
# Direct Python execution with uvicorn
python api_server.py
# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000
# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000
Running Both Simultaneously
# Terminal 1: Start MCP server
./run_mcp_server.sh
# Terminal 2: Start REST API server
./run_api_server.sh
Docker
# Build Docker image
docker build -t whisper-mcp-server .
# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
Architecture
Core Components
-
whisper_server.py - MCP server entry point
- Uses FastMCP framework to expose three MCP tools
- Delegates to transcriber.py for actual processing
- Server initialization at line 19
-
api_server.py - REST API server entry point
- Uses FastAPI framework to expose HTTP endpoints
- Provides 5 REST endpoints:
/,/health,/models,/transcribe,/batch-transcribe,/upload-transcribe - Shares the same core transcription logic with MCP server
- Includes file upload support via multipart/form-data
-
transcriber.py - Core transcription logic (shared by both servers)
transcribe_audio()(line 38) - Single file transcription with environment variable supportbatch_transcribe()(line 208) - Batch processing with progress reporting- All parameters support environment variable defaults
- Handles output formatting delegation to formatters.py
-
model_manager.py - Whisper model lifecycle management
get_whisper_model()(line 44) - Returns cached model instances or loads new onestest_gpu_driver()(line 20) - GPU validation before model loading- Global
model_instancesdict caches loaded models to prevent reloading - Automatically determines batch size based on available GPU memory (lines 113-134)
-
audio_processor.py - Audio file validation and preprocessing
validate_audio_file()(line 15) - Checks file existence, format, and sizeprocess_audio()(line 50) - Decodes audio using faster_whisper's decode_audio
-
formatters.py - Output format conversion
format_vtt(),format_srt(),format_txt(),format_json()- Convert segments to various formats- All formatters accept segment lists from Whisper output
Key Architecture Patterns
- Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (transcriber.py, model_manager.py, audio_processor.py, formatters.py), ensuring consistent behavior
- Model Caching: Models are cached in
model_instancesdictionary with key format{model_name}_{device}_{compute_type}(model_manager.py:84). This cache is shared if both servers run in the same process - Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (model_manager.py:109-134)
- Environment Variable Configuration: All transcription parameters support env var defaults (transcriber.py:19-36)
- Device Auto-Detection:
device="auto"automatically selects CUDA if available, otherwise CPU (model_manager.py:64-66)
Environment Variables
All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
API Server Specific:
API_HOST- API server host (default: 0.0.0.0)API_PORT- API server port (default: 8000)
Transcription Configuration (shared by both servers):
CUDA_VISIBLE_DEVICES- GPU device selectionWHISPER_MODEL_DIR- Model storage location (defaults to None for HuggingFace cache)TRANSCRIPTION_OUTPUT_DIR- Default output directory for single transcriptionsTRANSCRIPTION_BATCH_OUTPUT_DIR- Default output directory for batch processingTRANSCRIPTION_MODEL- Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)TRANSCRIPTION_DEVICE- Execution device (cpu, cuda, auto)TRANSCRIPTION_COMPUTE_TYPE- Computation type (float16, int8, auto)TRANSCRIPTION_OUTPUT_FORMAT- Output format (vtt, srt, txt, json)TRANSCRIPTION_BEAM_SIZE- Beam search size (default: 5)TRANSCRIPTION_TEMPERATURE- Sampling temperature (default: 0.0)TRANSCRIPTION_USE_TIMESTAMP- Add timestamp to filenames (true/false)TRANSCRIPTION_FILENAME_PREFIX- Prefix for output filenamesTRANSCRIPTION_FILENAME_SUFFIX- Suffix for output filenamesTRANSCRIPTION_LANGUAGE- Language code (zh, en, ja, etc., auto-detect if not set)
Supported Configurations
- Models: tiny, base, small, medium, large-v1, large-v2, large-v3
- Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
- Output formats: vtt, srt, json, txt
- Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
REST API Endpoints
The REST API server provides the following HTTP endpoints:
GET /
Returns API information and available endpoints.
GET /health
Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.
GET /models
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
POST /transcribe
Transcribe a single audio file that exists on the server.
Request Body:
{
"audio_path": "/path/to/audio.mp3",
"model_name": "large-v3",
"device": "auto",
"compute_type": "auto",
"language": "en",
"output_format": "txt",
"beam_size": 5,
"temperature": 0.0,
"initial_prompt": null,
"output_directory": null
}
Response:
{
"success": true,
"message": "Transcription successful, results saved to: /path/to/output.txt",
"output_path": "/path/to/output.txt"
}
POST /batch-transcribe
Batch transcribe all audio files in a folder.
Request Body:
{
"audio_folder": "/path/to/audio/folder",
"output_folder": "/path/to/output",
"model_name": "large-v3",
"output_format": "txt",
...
}
Response:
{
"success": true,
"summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}
POST /upload-transcribe
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
Form Data:
file: Audio file (multipart/form-data)model_name: Model name (default: "large-v3")device: Device (default: "auto")output_format: Output format (default: "txt")- ... (other transcription parameters)
Response: Returns the transcription file for download.
API Usage Examples
# Get model information
curl http://localhost:8000/models
# Transcribe existing file
curl -X POST http://localhost:8000/transcribe \
-H "Content-Type: application/json" \
-d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
-F "file=@audio.mp3" \
-F "output_format=txt" \
-F "model_name=large-v3"
Important Implementation Details
- GPU memory is checked before loading models (model_manager.py:115-127)
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (transcriber.py:101)
- Word timestamps are enabled by default (transcriber.py:106)
- Model loading includes GPU driver test to fail fast if GPU is unavailable (model_manager.py:92)
- Files over 1GB generate warnings about processing time (audio_processor.py:42)
- Default output format is "txt" for REST API, configured via environment variables for MCP server