164 lines
4.4 KiB
Markdown
164 lines
4.4 KiB
Markdown
# Whisper Speech Recognition MCP Server
|
|
---
|
|
[中文文档](README-CN.md)
|
|
---
|
|
A high-performance speech recognition MCP server based on Faster Whisper, providing efficient audio transcription capabilities.
|
|
|
|
## Features
|
|
|
|
- Integrated with Faster Whisper for efficient speech recognition
|
|
- Batch processing acceleration for improved transcription speed
|
|
- Automatic CUDA acceleration (if available)
|
|
- Support for multiple model sizes (tiny to large-v3)
|
|
- Output formats include VTT subtitles, SRT, and JSON
|
|
- Support for batch transcription of audio files in a folder
|
|
- Model instance caching to avoid repeated loading
|
|
- Dynamic batch size adjustment based on GPU memory
|
|
|
|
## Installation
|
|
|
|
### Dependencies
|
|
|
|
- Python 3.10+
|
|
- faster-whisper>=0.9.0
|
|
- torch==2.6.0+cu126
|
|
- torchaudio==2.6.0+cu126
|
|
- mcp[cli]>=1.2.0
|
|
|
|
### Installation Steps
|
|
|
|
1. Clone or download this repository
|
|
2. Create and activate a virtual environment (recommended)
|
|
3. Install dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### PyTorch Installation Guide
|
|
|
|
Install the appropriate version of PyTorch based on your CUDA version:
|
|
|
|
- CUDA 12.6:
|
|
```bash
|
|
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
|
|
```
|
|
|
|
- CUDA 12.1:
|
|
```bash
|
|
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
- CPU version:
|
|
```bash
|
|
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
|
|
```
|
|
|
|
You can check your CUDA version with `nvcc --version` or `nvidia-smi`.
|
|
|
|
## Usage
|
|
|
|
### Starting the Server
|
|
|
|
On Windows, simply run `start_server.bat`.
|
|
|
|
On other platforms, run:
|
|
|
|
```bash
|
|
python whisper_server.py
|
|
```
|
|
|
|
### Configuring Claude Desktop
|
|
|
|
1. Open the Claude Desktop configuration file:
|
|
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
|
|
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
|
|
2. Add the Whisper server configuration:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"whisper": {
|
|
"command": "python",
|
|
"args": ["D:/path/to/whisper_server.py"],
|
|
"env": {}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
3. Restart Claude Desktop
|
|
|
|
### Available Tools
|
|
|
|
The server provides the following tools:
|
|
|
|
1. **get_model_info** - Get information about available Whisper models
|
|
2. **transcribe** - Transcribe a single audio file
|
|
3. **batch_transcribe** - Batch transcribe audio files in a folder
|
|
|
|
## Performance Optimization Tips
|
|
|
|
- Using CUDA acceleration significantly improves transcription speed
|
|
- Batch processing mode is more efficient for large numbers of short audio files
|
|
- Batch size is automatically adjusted based on GPU memory size
|
|
- Using VAD (Voice Activity Detection) filtering improves accuracy for long audio
|
|
- Specifying the correct language can improve transcription quality
|
|
|
|
## Local Testing Methods
|
|
|
|
1. Use MCP Inspector for quick testing:
|
|
|
|
```bash
|
|
mcp dev whisper_server.py
|
|
```
|
|
|
|
2. Use Claude Desktop for integration testing
|
|
|
|
3. Use command line direct invocation (requires mcp[cli]):
|
|
|
|
```bash
|
|
mcp run whisper_server.py
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The server implements the following error handling mechanisms:
|
|
|
|
- Audio file existence check
|
|
- Model loading failure handling
|
|
- Transcription process exception catching
|
|
- GPU memory management
|
|
- Batch processing parameter adaptive adjustment
|
|
|
|
## Project Structure
|
|
|
|
- `whisper_server.py`: Main server code
|
|
- `model_manager.py`: Whisper model loading and caching
|
|
- `audio_processor.py`: Audio file validation and preprocessing
|
|
- `formatters.py`: Output formatting (VTT, SRT, JSON)
|
|
- `transcriber.py`: Core transcription logic
|
|
- `start_server.bat`: Windows startup script
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Acknowledgements
|
|
|
|
This project was developed with the assistance of these amazing AI tools and models:
|
|
|
|
- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer
|
|
- [Trae](https://trae.ai/) - Agentic AI coding assistant
|
|
- [Cline](https://cline.ai/) - AI-powered terminal
|
|
- [DeepSeek](https://www.deepseek.com/) - Advanced AI model
|
|
- [Claude-3.7-Sonnet](https://www.anthropic.com/claude) - Anthropic's powerful AI assistant
|
|
- [Gemini-2.0-Flash](https://ai.google/gemini/) - Google's multimodal AI model
|
|
- [VS Code](https://code.visualstudio.com/) - Powerful code editor
|
|
- [Whisper](https://github.com/openai/whisper) - OpenAI's speech recognition model
|
|
- [Faster Whisper](https://github.com/guillaumekln/faster-whisper) - Optimized Whisper implementation
|
|
|
|
Special thanks to these incredible tools and the teams behind them.
|
|
|