# Whisper Speech Recognition MCP Server
---
[中文文档](README-CN.md)
---
A high-performance speech recognition MCP server based on Faster Whisper, providing efficient audio transcription capabilities.

## Features

- Integrated with Faster Whisper for efficient speech recognition
- Batch processing acceleration for improved transcription speed
- Automatic CUDA acceleration (if available)
- Support for multiple model sizes (tiny to large-v3)
- Output formats include VTT subtitles, SRT, and JSON
- Support for batch transcription of audio files in a folder
- Model instance caching to avoid repeated loading
- Dynamic batch size adjustment based on GPU memory

## Installation

### Dependencies

- Python 3.10+
- faster-whisper>=0.9.0
- torch==2.6.0+cu126
- torchaudio==2.6.0+cu126
- mcp[cli]>=1.2.0

### Installation Steps

1. Clone or download this repository
2. Create and activate a virtual environment (recommended)
3. Install dependencies:

```bash
pip install -r requirements.txt
```

### PyTorch Installation Guide

Install the appropriate version of PyTorch based on your CUDA version:

- CUDA 12.6:
  ```bash
  pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
  ```

- CUDA 12.1:
  ```bash
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
  ```

- CPU version:
  ```bash
  pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
  ```

You can check your CUDA version with `nvcc --version` or `nvidia-smi`.

## Usage

### Starting the Server

On Windows, simply run `start_server.bat`.

On other platforms, run:

```bash
python whisper_server.py
```

### Configuring Claude Desktop

1. Open the Claude Desktop configuration file:
   - Windows: `%APPDATA%\Claude\claude_desktop_config.json`
   - macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`

2. Add the Whisper server configuration:

```json
{
  "mcpServers": {
    "whisper": {
      "command": "python",
      "args": ["D:/path/to/whisper_server.py"],
      "env": {}
    }
  }
}
```

3. Restart Claude Desktop

### Available Tools

The server provides the following tools:

1. **get_model_info** - Get information about available Whisper models
2. **transcribe** - Transcribe a single audio file
3. **batch_transcribe** - Batch transcribe audio files in a folder

## Performance Optimization Tips

- Using CUDA acceleration significantly improves transcription speed
- Batch processing mode is more efficient for large numbers of short audio files
- Batch size is automatically adjusted based on GPU memory size
- Using VAD (Voice Activity Detection) filtering improves accuracy for long audio
- Specifying the correct language can improve transcription quality

## Local Testing Methods

1. Use MCP Inspector for quick testing:

```bash
mcp dev whisper_server.py
```

2. Use Claude Desktop for integration testing

3. Use command line direct invocation (requires mcp[cli]):

```bash
mcp run whisper_server.py
```

## Error Handling

The server implements the following error handling mechanisms:

- Audio file existence check
- Model loading failure handling
- Transcription process exception catching
- GPU memory management
- Batch processing parameter adaptive adjustment

## Project Structure

- `whisper_server.py`: Main server code
- `model_manager.py`: Whisper model loading and caching
- `audio_processor.py`: Audio file validation and preprocessing
- `formatters.py`: Output formatting (VTT, SRT, JSON)
- `transcriber.py`: Core transcription logic
- `start_server.bat`: Windows startup script

## License

MIT

## Acknowledgements

This project was developed with the assistance of these amazing AI tools and models:

- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer
- [Trae](https://trae.ai/) - Agentic AI coding assistant
- [Cline](https://cline.ai/) - AI-powered terminal
- [DeepSeek](https://www.deepseek.com/) - Advanced AI model
- [Claude-3.7-Sonnet](https://www.anthropic.com/claude) - Anthropic's powerful AI assistant
- [Gemini-2.0-Flash](https://ai.google/gemini/) - Google's multimodal AI model
- [VS Code](https://code.visualstudio.com/) - Powerful code editor
- [Whisper](https://github.com/openai/whisper) - OpenAI's speech recognition model
- [Faster Whisper](https://github.com/guillaumekln/faster-whisper) - Optimized Whisper implementation

Special thanks to these incredible tools and the teams behind them.