alihan/Fast-Whisper-MCP-Server

Fork 0

Files

Alihan 2cc9f298a5 seperate mcp & api servers

2025-10-07 11:20:03 +03:00

9.2 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service can run as either:

MCP Server - For integration with Claude Desktop and other MCP clients
REST API Server - For HTTP-based integrations

Both servers share the same core transcription logic and can run independently or simultaneously on different ports.

Development Commands

Environment Setup

# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Running the Servers

MCP Server (for Claude Desktop)

# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh

# Direct Python execution
python whisper_server.py

# Using MCP CLI for development testing
mcp dev whisper_server.py

# Run server with MCP CLI
mcp run whisper_server.py

REST API Server (for HTTP clients)

# Using the startup script (recommended - sets all env vars)
./run_api_server.sh

# Direct Python execution with uvicorn
python api_server.py

# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000

# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000

Running Both Simultaneously

# Terminal 1: Start MCP server
./run_mcp_server.sh

# Terminal 2: Start REST API server
./run_api_server.sh

Docker

# Build Docker image
docker build -t whisper-mcp-server .

# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server

Architecture

Core Components

whisper_server.py - MCP server entry point
- Uses FastMCP framework to expose three MCP tools
- Delegates to transcriber.py for actual processing
- Server initialization at line 19
api_server.py - REST API server entry point
- Uses FastAPI framework to expose HTTP endpoints
- Provides 5 REST endpoints: /, /health, /models, /transcribe, /batch-transcribe, /upload-transcribe
- Shares the same core transcription logic with MCP server
- Includes file upload support via multipart/form-data
transcriber.py - Core transcription logic (shared by both servers)
- transcribe_audio() (line 38) - Single file transcription with environment variable support
- batch_transcribe() (line 208) - Batch processing with progress reporting
- All parameters support environment variable defaults
- Handles output formatting delegation to formatters.py
model_manager.py - Whisper model lifecycle management
- get_whisper_model() (line 44) - Returns cached model instances or loads new ones
- test_gpu_driver() (line 20) - GPU validation before model loading
- Global model_instances dict caches loaded models to prevent reloading
- Automatically determines batch size based on available GPU memory (lines 113-134)
audio_processor.py - Audio file validation and preprocessing
- validate_audio_file() (line 15) - Checks file existence, format, and size
- process_audio() (line 50) - Decodes audio using faster_whisper's decode_audio
formatters.py - Output format conversion
- format_vtt(), format_srt(), format_txt(), format_json() - Convert segments to various formats
- All formatters accept segment lists from Whisper output

Key Architecture Patterns

Dual Server Architecture: Both MCP and REST API servers import and use the same core modules (transcriber.py, model_manager.py, audio_processor.py, formatters.py), ensuring consistent behavior
Model Caching: Models are cached in model_instances dictionary with key format {model_name}_{device}_{compute_type} (model_manager.py:84). This cache is shared if both servers run in the same process
Batch Processing: CUDA devices automatically use BatchedInferencePipeline for performance (model_manager.py:109-134)
Environment Variable Configuration: All transcription parameters support env var defaults (transcriber.py:19-36)
Device Auto-Detection: device="auto" automatically selects CUDA if available, otherwise CPU (model_manager.py:64-66)

Environment Variables

All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:

API Server Specific:

API_HOST - API server host (default: 0.0.0.0)
API_PORT - API server port (default: 8000)

Transcription Configuration (shared by both servers):

CUDA_VISIBLE_DEVICES - GPU device selection
WHISPER_MODEL_DIR - Model storage location (defaults to None for HuggingFace cache)
TRANSCRIPTION_OUTPUT_DIR - Default output directory for single transcriptions
TRANSCRIPTION_BATCH_OUTPUT_DIR - Default output directory for batch processing
TRANSCRIPTION_MODEL - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
TRANSCRIPTION_DEVICE - Execution device (cpu, cuda, auto)
TRANSCRIPTION_COMPUTE_TYPE - Computation type (float16, int8, auto)
TRANSCRIPTION_OUTPUT_FORMAT - Output format (vtt, srt, txt, json)
TRANSCRIPTION_BEAM_SIZE - Beam search size (default: 5)
TRANSCRIPTION_TEMPERATURE - Sampling temperature (default: 0.0)
TRANSCRIPTION_USE_TIMESTAMP - Add timestamp to filenames (true/false)
TRANSCRIPTION_FILENAME_PREFIX - Prefix for output filenames
TRANSCRIPTION_FILENAME_SUFFIX - Suffix for output filenames
TRANSCRIPTION_LANGUAGE - Language code (zh, en, ja, etc., auto-detect if not set)

Supported Configurations

Models: tiny, base, small, medium, large-v1, large-v2, large-v3
Audio formats: .mp3, .wav, .m4a, .flac, .ogg, .aac
Output formats: vtt, srt, json, txt
Languages: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)

REST API Endpoints

The REST API server provides the following HTTP endpoints:

GET /

Returns API information and available endpoints.

GET /health

Health check endpoint. Returns {"status": "healthy", "service": "whisper-transcription"}.

GET /models

Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).

POST /transcribe

Transcribe a single audio file that exists on the server.

Request Body:

{
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
}

Response:

{
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
}

POST /batch-transcribe

Batch transcribe all audio files in a folder.