Compare commits
2 Commits
56ccc0e1d7
...
7c9a8d8378
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7c9a8d8378 | ||
|
|
2cc9f298a5 |
265
CLAUDE.md
Normal file
265
CLAUDE.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Overview
|
||||
|
||||
This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service can run as either:
|
||||
|
||||
1. **MCP Server** - For integration with Claude Desktop and other MCP clients
|
||||
2. **REST API Server** - For HTTP-based integrations
|
||||
|
||||
Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
|
||||
|
||||
## Development Commands
|
||||
|
||||
### Environment Setup
|
||||
```bash
|
||||
# Create and activate virtual environment
|
||||
python3.12 -m venv venv
|
||||
source venv/bin/activate
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Install PyTorch with CUDA 12.6 support
|
||||
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
|
||||
|
||||
# For CUDA 12.1
|
||||
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
||||
|
||||
# For CPU-only
|
||||
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
### Running the Servers
|
||||
|
||||
#### MCP Server (for Claude Desktop)
|
||||
|
||||
```bash
|
||||
# Using the startup script (recommended - sets all env vars)
|
||||
./run_mcp_server.sh
|
||||
|
||||
# Direct Python execution
|
||||
python whisper_server.py
|
||||
|
||||
# Using MCP CLI for development testing
|
||||
mcp dev whisper_server.py
|
||||
|
||||
# Run server with MCP CLI
|
||||
mcp run whisper_server.py
|
||||
```
|
||||
|
||||
#### REST API Server (for HTTP clients)
|
||||
|
||||
```bash
|
||||
# Using the startup script (recommended - sets all env vars)
|
||||
./run_api_server.sh
|
||||
|
||||
# Direct Python execution with uvicorn
|
||||
python api_server.py
|
||||
|
||||
# Or using uvicorn directly
|
||||
uvicorn api_server:app --host 0.0.0.0 --port 8000
|
||||
|
||||
# Development mode with auto-reload
|
||||
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
#### Running Both Simultaneously
|
||||
|
||||
```bash
|
||||
# Terminal 1: Start MCP server
|
||||
./run_mcp_server.sh
|
||||
|
||||
# Terminal 2: Start REST API server
|
||||
./run_api_server.sh
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
||||
```bash
|
||||
# Build Docker image
|
||||
docker build -t whisper-mcp-server .
|
||||
|
||||
# Run with GPU support
|
||||
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
1. **whisper_server.py** - MCP server entry point
|
||||
- Uses FastMCP framework to expose three MCP tools
|
||||
- Delegates to transcriber.py for actual processing
|
||||
- Server initialization at line 19
|
||||
|
||||
2. **api_server.py** - REST API server entry point
|
||||
- Uses FastAPI framework to expose HTTP endpoints
|
||||
- Provides 5 REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe`
|
||||
- Shares the same core transcription logic with MCP server
|
||||
- Includes file upload support via multipart/form-data
|
||||
|
||||
3. **transcriber.py** - Core transcription logic (shared by both servers)
|
||||
- `transcribe_audio()` (line 38) - Single file transcription with environment variable support
|
||||
- `batch_transcribe()` (line 208) - Batch processing with progress reporting
|
||||
- All parameters support environment variable defaults
|
||||
- Handles output formatting delegation to formatters.py
|
||||
|
||||
4. **model_manager.py** - Whisper model lifecycle management
|
||||
- `get_whisper_model()` (line 44) - Returns cached model instances or loads new ones
|
||||
- `test_gpu_driver()` (line 20) - GPU validation before model loading
|
||||
- Global `model_instances` dict caches loaded models to prevent reloading
|
||||
- Automatically determines batch size based on available GPU memory (lines 113-134)
|
||||
|
||||
5. **audio_processor.py** - Audio file validation and preprocessing
|
||||
- `validate_audio_file()` (line 15) - Checks file existence, format, and size
|
||||
- `process_audio()` (line 50) - Decodes audio using faster_whisper's decode_audio
|
||||
|
||||
6. **formatters.py** - Output format conversion
|
||||
- `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats
|
||||
- All formatters accept segment lists from Whisper output
|
||||
|
||||
### Key Architecture Patterns
|
||||
|
||||
- **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (transcriber.py, model_manager.py, audio_processor.py, formatters.py), ensuring consistent behavior
|
||||
- **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (model_manager.py:84). This cache is shared if both servers run in the same process
|
||||
- **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (model_manager.py:109-134)
|
||||
- **Environment Variable Configuration**: All transcription parameters support env var defaults (transcriber.py:19-36)
|
||||
- **Device Auto-Detection**: `device="auto"` automatically selects CUDA if available, otherwise CPU (model_manager.py:64-66)
|
||||
|
||||
## Environment Variables
|
||||
|
||||
All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
|
||||
|
||||
**API Server Specific:**
|
||||
- `API_HOST` - API server host (default: 0.0.0.0)
|
||||
- `API_PORT` - API server port (default: 8000)
|
||||
|
||||
**Transcription Configuration (shared by both servers):**
|
||||
|
||||
- `CUDA_VISIBLE_DEVICES` - GPU device selection
|
||||
- `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache)
|
||||
- `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions
|
||||
- `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing
|
||||
- `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
|
||||
- `TRANSCRIPTION_DEVICE` - Execution device (cpu, cuda, auto)
|
||||
- `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto)
|
||||
- `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json)
|
||||
- `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5)
|
||||
- `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0)
|
||||
- `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false)
|
||||
- `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames
|
||||
- `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames
|
||||
- `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set)
|
||||
|
||||
## Supported Configurations
|
||||
|
||||
- **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3
|
||||
- **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac
|
||||
- **Output formats**: vtt, srt, json, txt
|
||||
- **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
|
||||
|
||||
## REST API Endpoints
|
||||
|
||||
The REST API server provides the following HTTP endpoints:
|
||||
|
||||
### GET /
|
||||
Returns API information and available endpoints.
|
||||
|
||||
### GET /health
|
||||
Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`.
|
||||
|
||||
### GET /models
|
||||
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
|
||||
|
||||
### POST /transcribe
|
||||
Transcribe a single audio file that exists on the server.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"audio_path": "/path/to/audio.mp3",
|
||||
"model_name": "large-v3",
|
||||
"device": "auto",
|
||||
"compute_type": "auto",
|
||||
"language": "en",
|
||||
"output_format": "txt",
|
||||
"beam_size": 5,
|
||||
"temperature": 0.0,
|
||||
"initial_prompt": null,
|
||||
"output_directory": null
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Transcription successful, results saved to: /path/to/output.txt",
|
||||
"output_path": "/path/to/output.txt"
|
||||
}
|
||||
```
|
||||
|
||||
### POST /batch-transcribe
|
||||
Batch transcribe all audio files in a folder.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"audio_folder": "/path/to/audio/folder",
|
||||
"output_folder": "/path/to/output",
|
||||
"model_name": "large-v3",
|
||||
"output_format": "txt",
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
|
||||
}
|
||||
```
|
||||
|
||||
### POST /upload-transcribe
|
||||
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
|
||||
|
||||
**Form Data:**
|
||||
- `file`: Audio file (multipart/form-data)
|
||||
- `model_name`: Model name (default: "large-v3")
|
||||
- `device`: Device (default: "auto")
|
||||
- `output_format`: Output format (default: "txt")
|
||||
- ... (other transcription parameters)
|
||||
|
||||
**Response:** Returns the transcription file for download.
|
||||
|
||||
### API Usage Examples
|
||||
|
||||
```bash
|
||||
# Get model information
|
||||
curl http://localhost:8000/models
|
||||
|
||||
# Transcribe existing file
|
||||
curl -X POST http://localhost:8000/transcribe \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
|
||||
|
||||
# Upload and transcribe
|
||||
curl -X POST http://localhost:8000/upload-transcribe \
|
||||
-F "file=@audio.mp3" \
|
||||
-F "output_format=txt" \
|
||||
-F "model_name=large-v3"
|
||||
```
|
||||
|
||||
## Important Implementation Details
|
||||
|
||||
- GPU memory is checked before loading models (model_manager.py:115-127)
|
||||
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
|
||||
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (transcriber.py:101)
|
||||
- Word timestamps are enabled by default (transcriber.py:106)
|
||||
- Model loading includes GPU driver test to fail fast if GPU is unavailable (model_manager.py:92)
|
||||
- Files over 1GB generate warnings about processing time (audio_processor.py:42)
|
||||
- Default output format is "txt" for REST API, configured via environment variables for MCP server
|
||||
286
api_server.py
Normal file
286
api_server.py
Normal file
@@ -0,0 +1,286 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
FastAPI REST API Server for Whisper Transcription
|
||||
Provides HTTP REST endpoints for audio transcription
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
from typing import Optional
|
||||
from fastapi import FastAPI, HTTPException, UploadFile, File, Form
|
||||
from fastapi.responses import JSONResponse, FileResponse
|
||||
from pydantic import BaseModel, Field
|
||||
import json
|
||||
|
||||
from model_manager import get_model_info
|
||||
from transcriber import transcribe_audio, batch_transcribe
|
||||
|
||||
# Logging configuration
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Create FastAPI app
|
||||
app = FastAPI(
|
||||
title="Whisper Speech Recognition API",
|
||||
description="High-performance audio transcription API based on Faster Whisper",
|
||||
version="0.1.1"
|
||||
)
|
||||
|
||||
|
||||
# Request/Response Models
|
||||
class TranscribeRequest(BaseModel):
|
||||
audio_path: str = Field(..., description="Path to the audio file on the server")
|
||||
model_name: str = Field("large-v3", description="Whisper model name")
|
||||
device: str = Field("auto", description="Execution device (cpu, cuda, auto)")
|
||||
compute_type: str = Field("auto", description="Computation type (float16, int8, auto)")
|
||||
language: Optional[str] = Field(None, description="Language code (zh, en, ja, etc.)")
|
||||
output_format: str = Field("txt", description="Output format (vtt, srt, json, txt)")
|
||||
beam_size: int = Field(5, description="Beam search size")
|
||||
temperature: float = Field(0.0, description="Sampling temperature")
|
||||
initial_prompt: Optional[str] = Field(None, description="Initial prompt text")
|
||||
output_directory: Optional[str] = Field(None, description="Output directory path")
|
||||
|
||||
|
||||
class BatchTranscribeRequest(BaseModel):
|
||||
audio_folder: str = Field(..., description="Path to folder containing audio files")
|
||||
output_folder: Optional[str] = Field(None, description="Output folder path")
|
||||
model_name: str = Field("large-v3", description="Whisper model name")
|
||||
device: str = Field("auto", description="Execution device (cpu, cuda, auto)")
|
||||
compute_type: str = Field("auto", description="Computation type (float16, int8, auto)")
|
||||
language: Optional[str] = Field(None, description="Language code (zh, en, ja, etc.)")
|
||||
output_format: str = Field("txt", description="Output format (vtt, srt, json, txt)")
|
||||
beam_size: int = Field(5, description="Beam search size")
|
||||
temperature: float = Field(0.0, description="Sampling temperature")
|
||||
initial_prompt: Optional[str] = Field(None, description="Initial prompt text")
|
||||
parallel_files: int = Field(1, description="Number of files to process in parallel")
|
||||
|
||||
|
||||
class TranscribeResponse(BaseModel):
|
||||
success: bool
|
||||
message: str
|
||||
output_path: Optional[str] = None
|
||||
|
||||
|
||||
class BatchTranscribeResponse(BaseModel):
|
||||
success: bool
|
||||
summary: str
|
||||
|
||||
|
||||
# API Endpoints
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
"""Root endpoint with API information"""
|
||||
return {
|
||||
"name": "Whisper Speech Recognition API",
|
||||
"version": "0.1.1",
|
||||
"endpoints": {
|
||||
"GET /health": "Health check",
|
||||
"GET /models": "Get available models information",
|
||||
"POST /transcribe": "Transcribe a single audio file",
|
||||
"POST /batch-transcribe": "Batch transcribe audio files",
|
||||
"POST /upload-transcribe": "Upload and transcribe audio file"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health_check():
|
||||
"""Health check endpoint"""
|
||||
return {"status": "healthy", "service": "whisper-transcription"}
|
||||
|
||||
|
||||
@app.get("/models")
|
||||
async def get_models():
|
||||
"""Get available Whisper models and configuration information"""
|
||||
try:
|
||||
model_info = get_model_info()
|
||||
return JSONResponse(content=json.loads(model_info))
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get model info: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to get model info: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/transcribe", response_model=TranscribeResponse)
|
||||
async def transcribe(request: TranscribeRequest):
|
||||
"""
|
||||
Transcribe a single audio file
|
||||
|
||||
The audio file must already exist on the server at the specified path.
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Received transcription request for: {request.audio_path}")
|
||||
|
||||
result = transcribe_audio(
|
||||
audio_path=request.audio_path,
|
||||
model_name=request.model_name,
|
||||
device=request.device,
|
||||
compute_type=request.compute_type,
|
||||
language=request.language,
|
||||
output_format=request.output_format,
|
||||
beam_size=request.beam_size,
|
||||
temperature=request.temperature,
|
||||
initial_prompt=request.initial_prompt,
|
||||
output_directory=request.output_directory
|
||||
)
|
||||
|
||||
# Parse result to determine success
|
||||
if result.startswith("Error") or "failed" in result.lower():
|
||||
return TranscribeResponse(
|
||||
success=False,
|
||||
message=result,
|
||||
output_path=None
|
||||
)
|
||||
|
||||
# Extract output path from success message
|
||||
output_path = None
|
||||
if "saved to:" in result:
|
||||
output_path = result.split("saved to:")[1].strip()
|
||||
|
||||
return TranscribeResponse(
|
||||
success=True,
|
||||
message=result,
|
||||
output_path=output_path
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Transcription failed: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Transcription failed: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/batch-transcribe", response_model=BatchTranscribeResponse)
|
||||
async def batch_transcribe_endpoint(request: BatchTranscribeRequest):
|
||||
"""
|
||||
Batch transcribe all audio files in a folder
|
||||
|
||||
Processes all supported audio files in the specified folder.
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Received batch transcription request for: {request.audio_folder}")
|
||||
|
||||
result = batch_transcribe(
|
||||
audio_folder=request.audio_folder,
|
||||
output_folder=request.output_folder,
|
||||
model_name=request.model_name,
|
||||
device=request.device,
|
||||
compute_type=request.compute_type,
|
||||
language=request.language,
|
||||
output_format=request.output_format,
|
||||
beam_size=request.beam_size,
|
||||
temperature=request.temperature,
|
||||
initial_prompt=request.initial_prompt,
|
||||
parallel_files=request.parallel_files
|
||||
)
|
||||
|
||||
# Check if there were errors
|
||||
success = not result.startswith("Error")
|
||||
|
||||
return BatchTranscribeResponse(
|
||||
success=success,
|
||||
summary=result
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Batch transcription failed: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Batch transcription failed: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/upload-transcribe")
|
||||
async def upload_and_transcribe(
|
||||
file: UploadFile = File(...),
|
||||
model_name: str = Form("large-v3"),
|
||||
device: str = Form("auto"),
|
||||
compute_type: str = Form("auto"),
|
||||
language: Optional[str] = Form(None),
|
||||
output_format: str = Form("txt"),
|
||||
beam_size: int = Form(5),
|
||||
temperature: float = Form(0.0),
|
||||
initial_prompt: Optional[str] = Form(None)
|
||||
):
|
||||
"""
|
||||
Upload an audio file and transcribe it
|
||||
|
||||
This endpoint accepts file uploads via multipart/form-data.
|
||||
"""
|
||||
import tempfile
|
||||
import shutil
|
||||
|
||||
try:
|
||||
# Create temporary directory for upload
|
||||
temp_dir = tempfile.mkdtemp(prefix="whisper_upload_")
|
||||
|
||||
# Save uploaded file
|
||||
file_ext = os.path.splitext(file.filename)[1]
|
||||
temp_audio_path = os.path.join(temp_dir, f"upload{file_ext}")
|
||||
|
||||
with open(temp_audio_path, "wb") as buffer:
|
||||
shutil.copyfileobj(file.file, buffer)
|
||||
|
||||
logger.info(f"Uploaded file saved to: {temp_audio_path}")
|
||||
|
||||
# Transcribe the uploaded file
|
||||
result = transcribe_audio(
|
||||
audio_path=temp_audio_path,
|
||||
model_name=model_name,
|
||||
device=device,
|
||||
compute_type=compute_type,
|
||||
language=language,
|
||||
output_format=output_format,
|
||||
beam_size=beam_size,
|
||||
temperature=temperature,
|
||||
initial_prompt=initial_prompt,
|
||||
output_directory=temp_dir
|
||||
)
|
||||
|
||||
# Parse result
|
||||
if result.startswith("Error") or "failed" in result.lower():
|
||||
# Clean up temp files
|
||||
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||
raise HTTPException(status_code=500, detail=result)
|
||||
|
||||
# Extract output path
|
||||
output_path = None
|
||||
if "saved to:" in result:
|
||||
output_path = result.split("saved to:")[1].strip()
|
||||
|
||||
# Return the transcription file
|
||||
if output_path and os.path.exists(output_path):
|
||||
return FileResponse(
|
||||
output_path,
|
||||
media_type="text/plain",
|
||||
filename=os.path.basename(output_path),
|
||||
background=None # Don't delete yet, we'll clean up after
|
||||
)
|
||||
else:
|
||||
# Clean up temp files
|
||||
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||
return JSONResponse(content={
|
||||
"success": True,
|
||||
"message": result
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Upload and transcribe failed: {str(e)}")
|
||||
# Clean up temp files on error
|
||||
if 'temp_dir' in locals():
|
||||
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||
raise HTTPException(status_code=500, detail=f"Upload and transcribe failed: {str(e)}")
|
||||
finally:
|
||||
await file.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
# Get configuration from environment variables
|
||||
host = os.getenv("API_HOST", "0.0.0.0")
|
||||
port = int(os.getenv("API_PORT", "8000"))
|
||||
|
||||
logger.info(f"Starting Whisper REST API server on {host}:{port}")
|
||||
|
||||
uvicorn.run(
|
||||
app,
|
||||
host=host,
|
||||
port=port,
|
||||
log_level="info"
|
||||
)
|
||||
6
mcp.logs
6
mcp.logs
@@ -1,6 +0,0 @@
|
||||
{"jsonrpc":"2.0","id":1,"result":{"protocolVersion":"2025-03-26","capabilities":{"experimental":{},"prompts":{"listChanged":false},"resources":{"subscribe":false,"listChanged":false},"tools":{"listChanged":false}},"serverInfo":{"name":"fast-whisper-mcp-server","version":"1.9.4"}}}
|
||||
INFO:mcp.server.lowlevel.server:Processing request of type ListToolsRequest
|
||||
{"jsonrpc":"2.0","id":2,"result":{"tools":[{"name":"get_model_info_api","description":"\n Get available Whisper model information\n ","inputSchema":{"properties":{},"title":"get_model_info_apiArguments","type":"object"}},{"name":"transcribe","description":"\n Transcribe audio files using Faster Whisper\n\n Args:\n audio_path: Path to the audio file\n model_name: Model name (tiny, base, small, medium, large-v1, large-v2, large-v3)\n device: Execution device (cpu, cuda, auto)\n compute_type: Computation type (float16, int8, auto)\n language: Language code (such as zh, en, ja, etc., auto-detect by default)\n output_format: Output format (vtt, srt, json or txt)\n beam_size: Beam search size, larger values may improve accuracy but reduce speed\n temperature: Sampling temperature, greedy decoding\n initial_prompt: Initial prompt text, can help the model better understand context\n output_directory: Output directory path, defaults to the audio file's directory\n\n Returns:\n str: Transcription result, in VTT subtitle or JSON format\n ","inputSchema":{"properties":{"audio_path":{"title":"Audio Path","type":"string"},"model_name":{"default":"large-v3","title":"Model Name","type":"string"},"device":{"default":"auto","title":"Device","type":"string"},"compute_type":{"default":"auto","title":"Compute Type","type":"string"},"language":{"default":null,"title":"Language","type":"string"},"output_format":{"default":"vtt","title":"Output Format","type":"string"},"beam_size":{"default":5,"title":"Beam Size","type":"integer"},"temperature":{"default":0.0,"title":"Temperature","type":"number"},"initial_prompt":{"default":null,"title":"Initial Prompt","type":"string"},"output_directory":{"default":null,"title":"Output Directory","type":"string"}},"required":["audio_path"],"title":"transcribeArguments","type":"object"}},{"name":"batch_transcribe_audio","description":"\n Batch transcribe audio files in a folder\n\n Args:\n audio_folder: Path to the folder containing audio files\n output_folder: Output folder path, defaults to a 'transcript' subfolder in audio_folder\n model_name: Model name (tiny, base, small, medium, large-v1, large-v2, large-v3)\n device: Execution device (cpu, cuda, auto)\n compute_type: Computation type (float16, int8, auto)\n language: Language code (such as zh, en, ja, etc., auto-detect by default)\n output_format: Output format (vtt, srt, json or txt)\n beam_size: Beam search size, larger values may improve accuracy but reduce speed\n temperature: Sampling temperature, 0 means greedy decoding\n initial_prompt: Initial prompt text, can help the model better understand context\n parallel_files: Number of files to process in parallel (only effective in CPU mode)\n\n Returns:\n str: Batch processing summary, including processing time and success rate\n ","inputSchema":{"properties":{"audio_folder":{"title":"Audio Folder","type":"string"},"output_folder":{"default":null,"title":"Output Folder","type":"string"},"model_name":{"default":"large-v3","title":"Model Name","type":"string"},"device":{"default":"auto","title":"Device","type":"string"},"compute_type":{"default":"auto","title":"Compute Type","type":"string"},"language":{"default":null,"title":"Language","type":"string"},"output_format":{"default":"vtt","title":"Output Format","type":"string"},"beam_size":{"default":5,"title":"Beam Size","type":"integer"},"temperature":{"default":0.0,"title":"Temperature","type":"number"},"initial_prompt":{"default":null,"title":"Initial Prompt","type":"string"},"parallel_files":{"default":1,"title":"Parallel Files","type":"integer"}},"required":["audio_folder"],"title":"batch_transcribe_audioArguments","type":"object"}}]}}
|
||||
INFO:mcp.server.lowlevel.server:Processing request of type CallToolRequest
|
||||
INFO:model_manager:GPU test passed: NVIDIA GeForce RTX 3060 (12.5GB)
|
||||
INFO:model_manager:Loading Whisper model: large-v3 device: cuda compute type: float16
|
||||
@@ -8,6 +8,11 @@ torchaudio #==2.6.0+cu126
|
||||
# pip install mcp[cli]>=1.2.0
|
||||
mcp[cli]
|
||||
|
||||
# REST API dependencies
|
||||
fastapi>=0.115.0
|
||||
uvicorn[standard]>=0.32.0
|
||||
python-multipart>=0.0.9
|
||||
|
||||
# PyTorch Installation Guide:
|
||||
# Please install the appropriate version of PyTorch based on your CUDA version:
|
||||
#
|
||||
|
||||
42
run_api_server.sh
Executable file
42
run_api_server.sh
Executable file
@@ -0,0 +1,42 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
datetime_prefix() {
|
||||
date "+[%Y-%m-%d %H:%M:%S]"
|
||||
}
|
||||
|
||||
# Set environment variables
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
export WHISPER_MODEL_DIR="/home/uad/agents/tools/mcp-transcriptor/data/models"
|
||||
export TRANSCRIPTION_OUTPUT_DIR="/media/raid/agents/tools/mcp-transcriptor/outputs"
|
||||
export TRANSCRIPTION_BATCH_OUTPUT_DIR="/media/raid/agents/tools/mcp-transcriptor/outputs/batch"
|
||||
export TRANSCRIPTION_MODEL="large-v3"
|
||||
export TRANSCRIPTION_DEVICE="cuda"
|
||||
export TRANSCRIPTION_COMPUTE_TYPE="float16"
|
||||
export TRANSCRIPTION_OUTPUT_FORMAT="txt"
|
||||
export TRANSCRIPTION_BEAM_SIZE="5"
|
||||
export TRANSCRIPTION_TEMPERATURE="0.0"
|
||||
export TRANSCRIPTION_USE_TIMESTAMP="false"
|
||||
export TRANSCRIPTION_FILENAME_PREFIX=""
|
||||
|
||||
# API server configuration
|
||||
export API_HOST="0.0.0.0"
|
||||
export API_PORT="8000"
|
||||
|
||||
# Log start of the script
|
||||
echo "$(datetime_prefix) Starting Whisper REST API server..."
|
||||
echo "$(datetime_prefix) Model directory: $WHISPER_MODEL_DIR"
|
||||
echo "$(datetime_prefix) API server: http://$API_HOST:$API_PORT"
|
||||
|
||||
# Optional: Verify required directories exist
|
||||
if [ ! -d "$WHISPER_MODEL_DIR" ]; then
|
||||
echo "$(datetime_prefix) Warning: Whisper model directory does not exist: $WHISPER_MODEL_DIR"
|
||||
echo "$(datetime_prefix) Models will be downloaded to default cache directory"
|
||||
fi
|
||||
|
||||
# Ensure output directories exist
|
||||
mkdir -p "$TRANSCRIPTION_OUTPUT_DIR"
|
||||
mkdir -p "$TRANSCRIPTION_BATCH_OUTPUT_DIR"
|
||||
|
||||
# Run the API server
|
||||
/home/uad/agents/tools/mcp-transcriptor/venv/bin/python -u /home/uad/agents/tools/mcp-transcriptor/api_server.py 2>&1 | tee /home/uad/agents/tools/mcp-transcriptor/api.logs
|
||||
@@ -34,4 +34,6 @@ if [ ! -d "$WHISPER_MODEL_DIR" ]; then
|
||||
fi
|
||||
|
||||
# Run the Python script with the defined environment variables
|
||||
/home/uad/agents/tools/mcp-transcriptor/venv/bin/python /home/uad/agents/tools/mcp-transcriptor/whisper_server.py 2>&1 | tee /home/uad/agents/tools/mcp-transcriptor/mcp.logs
|
||||
#/home/uad/agents/tools/mcp-transcriptor/venv/bin/python /home/uad/agents/tools/mcp-transcriptor/whisper_server.py 2>&1 | tee /home/uad/agents/tools/mcp-transcriptor/mcp.logs
|
||||
/home/uad/agents/tools/mcp-transcriptor/venv/bin/python -u /home/uad/agents/tools/mcp-transcriptor/whisper_server.py 2>&1 | tee /home/uad/agents/tools/mcp-transcriptor/mcp.logs
|
||||
|
||||
@@ -98,7 +98,6 @@ def transcribe_audio(
|
||||
|
||||
# Set transcription parameters
|
||||
options = {
|
||||
"verbose": True,
|
||||
"language": language,
|
||||
"vad_filter": True,
|
||||
"vad_parameters": {"min_silence_duration_ms": 500},
|
||||
@@ -181,12 +180,12 @@ def transcribe_audio(
|
||||
# Add suffix if specified
|
||||
if FILENAME_SUFFIX:
|
||||
filename_parts.append(FILENAME_SUFFIX)
|
||||
|
||||
|
||||
# Add timestamp if enabled
|
||||
if USE_TIMESTAMP:
|
||||
timestamp = time.strftime("%Y%m%d%H%M%S")
|
||||
filename_parts.append(timestamp)
|
||||
|
||||
|
||||
# Join parts and add extension
|
||||
base_name = "_".join(filename_parts)
|
||||
output_filename = f"{base_name}.{output_format_lower}"
|
||||
@@ -358,4 +357,4 @@ def report_progress(current: int, total: int, elapsed_time: float) -> str:
|
||||
eta = (elapsed_time / current) * (total - current) if current > 0 else 0
|
||||
return (f"Progress: {current}/{total} ({progress:.1f}%)" +
|
||||
f" | Time used: {format_time(elapsed_time)}" +
|
||||
f" | Estimated remaining: {format_time(eta)}")
|
||||
f" | Estimated remaining: {format_time(eta)}")
|
||||
|
||||
Reference in New Issue
Block a user