Files
Fast-Whisper-MCP-Server/CLAUDE.md
2025-10-07 11:20:03 +03:00

9.2 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service can run as either:

  1. MCP Server - For integration with Claude Desktop and other MCP clients
  2. REST API Server - For HTTP-based integrations

Both servers share the same core transcription logic and can run independently or simultaneously on different ports.

Development Commands

Environment Setup

# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Running the Servers

MCP Server (for Claude Desktop)

# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh

# Direct Python execution
python whisper_server.py

# Using MCP CLI for development testing
mcp dev whisper_server.py

# Run server with MCP CLI
mcp run whisper_server.py

REST API Server (for HTTP clients)

# Using the startup script (recommended - sets all env vars)
./run_api_server.sh

# Direct Python execution with uvicorn
python api_server.py

# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000

# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000

Running Both Simultaneously

# Terminal 1: Start MCP server
./run_mcp_server.sh

# Terminal 2: Start REST API server
./run_api_server.sh

Docker

# Build Docker image
docker build -t whisper-mcp-server .

# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server

Architecture

Core Components

  1. whisper_server.py - MCP server entry point

    • Uses FastMCP framework to expose three MCP tools
    • Delegates to transcriber.py for actual processing
    • Server initialization at line 19
  2. api_server.py - REST API server entry point

    • Uses FastAPI framework to expose HTTP endpoints
    • Provides 5 REST endpoints: /, /health, /models, /transcribe, /batch-transcribe, /upload-transcribe
    • Shares the same core transcription logic with MCP server
    • Includes file upload support via multipart/form-data
  3. transcriber.py - Core transcription logic (shared by both servers)

    • transcribe_audio() (line 38) - Single file transcription with environment variable support
    • batch_transcribe() (line 208) - Batch processing with progress reporting
    • All parameters support environment variable defaults
    • Handles output formatting delegation to formatters.py
  4. model_manager.py - Whisper model lifecycle management

    • get_whisper_model() (line 44) - Returns cached model instances or loads new ones
    • test_gpu_driver() (line 20) - GPU validation before model loading
    • Global model_instances dict caches loaded models to prevent reloading
    • Automatically determines batch size based on available GPU memory (lines 113-134)
  5. audio_processor.py - Audio file validation and preprocessing

    • validate_audio_file() (line 15) - Checks file existence, format, and size
    • process_audio() (line 50) - Decodes audio using faster_whisper's decode_audio
  6. formatters.py - Output format conversion

    • format_vtt(), format_srt(), format_txt(), format_json() - Convert segments to various formats
    • All formatters accept segment lists from Whisper output

Key Architecture Patterns

  • Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (transcriber.py, model_manager.py, audio_processor.py, formatters.py), ensuring consistent behavior
  • Model Caching: Models are cached in model_instances dictionary with key format {model_name}_{device}_{compute_type} (model_manager.py:84). This cache is shared if both servers run in the same process
  • Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (model_manager.py:109-134)
  • Environment Variable Configuration: All transcription parameters support env var defaults (transcriber.py:19-36)
  • Device Auto-Detection: device="auto" automatically selects CUDA if available, otherwise CPU (model_manager.py:64-66)

Environment Variables

All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:

API Server Specific:

  • API_HOST - API server host (default: 0.0.0.0)
  • API_PORT - API server port (default: 8000)

Transcription Configuration (shared by both servers):

  • CUDA_VISIBLE_DEVICES - GPU device selection
  • WHISPER_MODEL_DIR - Model storage location (defaults to None for HuggingFace cache)
  • TRANSCRIPTION_OUTPUT_DIR - Default output directory for single transcriptions
  • TRANSCRIPTION_BATCH_OUTPUT_DIR - Default output directory for batch processing
  • TRANSCRIPTION_MODEL - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
  • TRANSCRIPTION_DEVICE - Execution device (cpu, cuda, auto)
  • TRANSCRIPTION_COMPUTE_TYPE - Computation type (float16, int8, auto)
  • TRANSCRIPTION_OUTPUT_FORMAT - Output format (vtt, srt, txt, json)
  • TRANSCRIPTION_BEAM_SIZE - Beam search size (default: 5)
  • TRANSCRIPTION_TEMPERATURE - Sampling temperature (default: 0.0)
  • TRANSCRIPTION_USE_TIMESTAMP - Add timestamp to filenames (true/false)
  • TRANSCRIPTION_FILENAME_PREFIX - Prefix for output filenames
  • TRANSCRIPTION_FILENAME_SUFFIX - Suffix for output filenames
  • TRANSCRIPTION_LANGUAGE - Language code (zh, en, ja, etc., auto-detect if not set)

Supported Configurations

  • Models: tiny, base, small, medium, large-v1, large-v2, large-v3
  • Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
  • Output formats: vtt, srt, json, txt
  • Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)

REST API Endpoints

The REST API server provides the following HTTP endpoints:

GET /

Returns API information and available endpoints.

GET /health

Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.

GET /models

Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).

POST /transcribe

Transcribe a single audio file that exists on the server.

Request Body:

{
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
}

Response:

{
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
}

POST /batch-transcribe

Batch transcribe all audio files in a folder.

Request Body:

{
  "audio_folder": "/path/to/audio/folder",
  "output_folder": "/path/to/output",
  "model_name": "large-v3",
  "output_format": "txt",
  ...
}

Response:

{
  "success": true,
  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}

POST /upload-transcribe

Upload an audio file and transcribe it immediately. Returns the transcription file as a download.

Form Data:

  • file: Audio file (multipart/form-data)
  • model_name: Model name (default: "large-v3")
  • device: Device (default: "auto")
  • output_format: Output format (default: "txt")
  • ... (other transcription parameters)

Response: Returns the transcription file for download.

API Usage Examples

# Get model information
curl http://localhost:8000/models

# Transcribe existing file
curl -X POST http://localhost:8000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'

# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
  -F "file=@audio.mp3" \
  -F "output_format=txt" \
  -F "model_name=large-v3"

Important Implementation Details

  • GPU memory is checked before loading models (model_manager.py:115-127)
  • Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
  • VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (transcriber.py:101)
  • Word timestamps are enabled by default (transcriber.py:106)
  • Model loading includes GPU driver test to fail fast if GPU is unavailable (model_manager.py:92)
  • Files over 1GB generate warnings about processing time (audio_processor.py:42)
  • Default output format is "txt" for REST API, configured via environment variables for MCP server