# Whisper Speech Recognition MCP Server --- [中文文档](README-CN.md) --- A high-performance speech recognition MCP server based on Faster Whisper, providing efficient audio transcription capabilities. ## Features - Integrated with Faster Whisper for efficient speech recognition - Batch processing acceleration for improved transcription speed - Automatic CUDA acceleration (if available) - Support for multiple model sizes (tiny to large-v3) - Output formats include VTT subtitles, SRT, and JSON - Support for batch transcription of audio files in a folder - Model instance caching to avoid repeated loading - Dynamic batch size adjustment based on GPU memory ## Installation ### Dependencies - Python 3.10+ - faster-whisper>=0.9.0 - torch==2.6.0+cu126 - torchaudio==2.6.0+cu126 - mcp[cli]>=1.2.0 ### Installation Steps 1. Clone or download this repository 2. Create and activate a virtual environment (recommended) 3. Install dependencies: ```bash pip install -r requirements.txt ``` ### PyTorch Installation Guide Install the appropriate version of PyTorch based on your CUDA version: - CUDA 12.6: ```bash pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 ``` - CUDA 12.1: ```bash pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 ``` - CPU version: ```bash pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu ``` You can check your CUDA version with `nvcc --version` or `nvidia-smi`. ## Usage ### Starting the Server On Windows, simply run `start_server.bat`. On other platforms, run: ```bash python whisper_server.py ``` ### Configuring Claude Desktop 1. Open the Claude Desktop configuration file: - Windows: `%APPDATA%\Claude\claude_desktop_config.json` - macOS: `~/Library/Application Support/Claude/claude_desktop_config.json` 2. Add the Whisper server configuration: ```json { "mcpServers": { "whisper": { "command": "python", "args": ["D:/path/to/whisper_server.py"], "env": {} } } } ``` 3. Restart Claude Desktop ### Available Tools The server provides the following tools: 1. **get_model_info** - Get information about available Whisper models 2. **transcribe** - Transcribe a single audio file 3. **batch_transcribe** - Batch transcribe audio files in a folder ## Performance Optimization Tips - Using CUDA acceleration significantly improves transcription speed - Batch processing mode is more efficient for large numbers of short audio files - Batch size is automatically adjusted based on GPU memory size - Using VAD (Voice Activity Detection) filtering improves accuracy for long audio - Specifying the correct language can improve transcription quality ## Local Testing Methods 1. Use MCP Inspector for quick testing: ```bash mcp dev whisper_server.py ``` 2. Use Claude Desktop for integration testing 3. Use command line direct invocation (requires mcp[cli]): ```bash mcp run whisper_server.py ``` ## Error Handling The server implements the following error handling mechanisms: - Audio file existence check - Model loading failure handling - Transcription process exception catching - GPU memory management - Batch processing parameter adaptive adjustment ## Project Structure - `whisper_server.py`: Main server code - `model_manager.py`: Whisper model loading and caching - `audio_processor.py`: Audio file validation and preprocessing - `formatters.py`: Output formatting (VTT, SRT, JSON) - `transcriber.py`: Core transcription logic - `start_server.bat`: Windows startup script ## License MIT ## Acknowledgements This project was developed with the assistance of these amazing AI tools and models: - [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer - [Trae](https://trae.ai/) - Agentic AI coding assistant - [Cline](https://cline.ai/) - AI-powered terminal - [DeepSeek](https://www.deepseek.com/) - Advanced AI model - [Claude-3.7-Sonnet](https://www.anthropic.com/claude) - Anthropic's powerful AI assistant - [Gemini-2.0-Flash](https://ai.google/gemini/) - Google's multimodal AI model - [VS Code](https://code.visualstudio.com/) - Powerful code editor - [Whisper](https://github.com/openai/whisper) - OpenAI's speech recognition model - [Faster Whisper](https://github.com/guillaumekln/faster-whisper) - Optimized Whisper implementation Special thanks to these incredible tools and the teams behind them.