update claude md
This commit is contained in:
678
CLAUDE.md
678
CLAUDE.md
@@ -2,565 +2,213 @@
|
|||||||
|
|
||||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
## Overview
|
## Project Overview
|
||||||
|
|
||||||
This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:
|
**fast-whisper-mcp-server** is a high-performance audio transcription service built on faster-whisper with dual-server architecture:
|
||||||
|
- **MCP Server** (`whisper_server.py`): Model Context Protocol interface for LLM integration
|
||||||
|
- **REST API Server** (`api_server.py`): HTTP REST endpoints with FastAPI
|
||||||
|
|
||||||
1. **MCP Server** - For integration with Claude Desktop and other MCP clients
|
The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. **GPU is required** - there is no CPU fallback.
|
||||||
2. **REST API Server** - For HTTP-based integrations with async job queue support
|
|
||||||
|
|
||||||
Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
|
## Core Commands
|
||||||
|
|
||||||
**Key Features:**
|
### Running Servers
|
||||||
- Async job queue system for long-running transcriptions (prevents HTTP timeouts)
|
|
||||||
- GPU health monitoring with strict failure detection (prevents silent CPU fallback)
|
|
||||||
- **Automatic GPU driver reset** on CUDA errors with cooldown protection (handles sleep/wake issues)
|
|
||||||
- Dual-server architecture (MCP + REST API)
|
|
||||||
- Model caching for fast repeated transcriptions
|
|
||||||
- Automatic batch size optimization based on GPU memory
|
|
||||||
|
|
||||||
## Development Commands
|
|
||||||
|
|
||||||
### Environment Setup
|
|
||||||
```bash
|
```bash
|
||||||
# Create and activate virtual environment
|
# MCP Server (for LLM integration via MCP)
|
||||||
|
./run_mcp_server.sh
|
||||||
|
|
||||||
|
# REST API Server (for HTTP clients)
|
||||||
|
./run_api_server.sh
|
||||||
|
|
||||||
|
# Both servers log to mcp.logs and api.logs respectively
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
```bash
|
||||||
|
# Run core component tests (GPU health, job queue, validation)
|
||||||
|
python tests/test_core_components.py
|
||||||
|
|
||||||
|
# Run async API integration tests
|
||||||
|
python tests/test_async_api_integration.py
|
||||||
|
|
||||||
|
# Run end-to-end integration tests
|
||||||
|
python tests/test_e2e_integration.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### GPU Management
|
||||||
|
```bash
|
||||||
|
# Reset GPU drivers without rebooting (requires sudo)
|
||||||
|
./reset_gpu.sh
|
||||||
|
|
||||||
|
# Check GPU status
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
# Monitor GPU during transcription
|
||||||
|
watch -n 1 nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
```bash
|
||||||
|
# Create virtual environment
|
||||||
python3.12 -m venv venv
|
python3.12 -m venv venv
|
||||||
source venv/bin/activate
|
source venv/bin/activate
|
||||||
|
|
||||||
# Install dependencies
|
# Install dependencies (check requirements.txt for CUDA-specific instructions)
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
|
|
||||||
# Install PyTorch with CUDA 12.6 support
|
# For CUDA 12.4:
|
||||||
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
|
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
|
||||||
|
|
||||||
# For CUDA 12.1
|
|
||||||
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
|
||||||
|
|
||||||
# For CPU-only
|
|
||||||
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
|
|
||||||
```
|
|
||||||
|
|
||||||
### Running the Servers
|
|
||||||
|
|
||||||
#### MCP Server (for Claude Desktop)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Using the startup script (recommended - sets all env vars)
|
|
||||||
./run_mcp_server.sh
|
|
||||||
|
|
||||||
# Direct Python execution (ensure PYTHONPATH includes src/)
|
|
||||||
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
|
|
||||||
python src/servers/whisper_server.py
|
|
||||||
|
|
||||||
# Using MCP CLI for development testing
|
|
||||||
mcp dev src/servers/whisper_server.py
|
|
||||||
```
|
|
||||||
|
|
||||||
#### REST API Server (for HTTP clients)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Using the startup script (recommended - sets all env vars)
|
|
||||||
./run_api_server.sh
|
|
||||||
|
|
||||||
# Direct Python execution with uvicorn (ensure PYTHONPATH includes src/)
|
|
||||||
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
|
|
||||||
uvicorn src.servers.api_server:app --host 0.0.0.0 --port 8000
|
|
||||||
|
|
||||||
# Development mode with auto-reload
|
|
||||||
uvicorn src.servers.api_server:app --reload --host 0.0.0.0 --port 8000
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Running Both Simultaneously
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Terminal 1: Start MCP server
|
|
||||||
./run_mcp_server.sh
|
|
||||||
|
|
||||||
# Terminal 2: Start REST API server
|
|
||||||
./run_api_server.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
### Running Tests
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run all tests (requires GPU)
|
|
||||||
python tests/test_core_components.py
|
|
||||||
python tests/test_e2e_integration.py
|
|
||||||
python tests/test_async_api_integration.py
|
|
||||||
|
|
||||||
# Or run individual test components
|
|
||||||
cd tests && python test_core_components.py
|
|
||||||
```
|
|
||||||
|
|
||||||
**Important**: All tests require GPU to be available. Tests will fail if CUDA is not properly configured.
|
|
||||||
|
|
||||||
### Docker
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build Docker image
|
|
||||||
docker build -t whisper-mcp-server .
|
|
||||||
|
|
||||||
# Run with GPU support
|
|
||||||
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
### Directory Structure
|
### Directory Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
.
|
src/
|
||||||
├── src/ # Source code directory
|
├── core/ # Core business logic
|
||||||
│ ├── servers/ # Server implementations
|
│ ├── transcriber.py # Main transcription logic with env var defaults
|
||||||
│ │ ├── whisper_server.py # MCP server entry point
|
│ ├── model_manager.py # Whisper model loading/caching (GPU-only)
|
||||||
│ │ └── api_server.py # REST API server (async job queue)
|
│ ├── job_queue.py # Async FIFO job queue with worker thread
|
||||||
│ ├── core/ # Core business logic
|
│ ├── gpu_health.py # Real GPU health checks with circuit breaker
|
||||||
│ │ ├── transcriber.py # Transcription logic (single & batch)
|
│ └── gpu_reset.py # Automatic GPU driver reset logic
|
||||||
│ │ ├── model_manager.py # Model lifecycle & caching
|
├── servers/ # Server implementations
|
||||||
│ │ ├── job_queue.py # Async job queue manager
|
│ ├── whisper_server.py # MCP server (stdio transport)
|
||||||
│ │ ├── gpu_health.py # GPU health monitoring
|
│ └── api_server.py # FastAPI REST server
|
||||||
│ │ └── gpu_reset.py # GPU driver reset with cooldown
|
└── utils/ # Utilities
|
||||||
│ └── utils/ # Utility modules
|
├── startup.py # Common startup sequence (GPU check, initialization)
|
||||||
│ ├── audio_processor.py # Audio validation & preprocessing
|
├── circuit_breaker.py # Circuit breaker pattern implementation
|
||||||
│ ├── formatters.py # Output format conversion
|
├── audio_processor.py # Audio file validation
|
||||||
│ ├── test_audio_generator.py # Test audio generation for GPU checks
|
├── formatters.py # Output format handlers (txt, vtt, srt, json)
|
||||||
│ ├── startup.py # Startup sequence orchestration
|
├── input_validation.py# Input validation utilities
|
||||||
│ ├── circuit_breaker.py # Circuit breaker pattern implementation
|
└── test_audio_generator.py # Generate test audio for health checks
|
||||||
│ └── input_validation.py # Input validation utilities
|
|
||||||
├── tests/ # Test suite (requires GPU)
|
|
||||||
│ ├── test_core_components.py # Core functionality tests
|
|
||||||
│ ├── test_e2e_integration.py # End-to-end integration tests
|
|
||||||
│ └── test_async_api_integration.py # Async API tests
|
|
||||||
├── run_mcp_server.sh # MCP server startup script
|
|
||||||
├── run_api_server.sh # API server startup script
|
|
||||||
├── reset_gpu.sh # GPU driver reset script
|
|
||||||
├── DEV_PLAN.md # Development plan for async features
|
|
||||||
├── requirements.txt # Python dependencies
|
|
||||||
└── pyproject.toml # Project configuration
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Core Components
|
### Key Architectural Patterns
|
||||||
|
|
||||||
1. **src/servers/whisper_server.py** - MCP server entry point
|
**Async Job Queue** (`job_queue.py`):
|
||||||
- Uses FastMCP framework to expose MCP tools
|
- FIFO queue with background worker thread
|
||||||
- Main tools: `get_model_info_api()`, `transcribe_async()`, `transcribe_upload()`, `check_job_status()`, `get_job_result()`
|
- Disk persistence of job metadata to `JOB_METADATA_DIR`
|
||||||
- Global job queue and health monitor instances
|
- States: QUEUED → RUNNING → COMPLETED/FAILED
|
||||||
- Server initialization around line 31
|
- Jobs include full request params + results
|
||||||
|
- Thread-safe operations with locks
|
||||||
|
|
||||||
2. **src/servers/api_server.py** - REST API server entry point
|
**GPU Health Monitoring** (`gpu_health.py`):
|
||||||
- Uses FastAPI framework for HTTP endpoints
|
- Performs **real** GPU checks: loads tiny model + transcribes test audio
|
||||||
- Provides REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe`
|
- Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
|
||||||
- Shares core transcription logic with MCP server
|
- Integration with GPU auto-reset on failures
|
||||||
- File upload support via multipart/form-data
|
- Background monitoring thread in `HealthMonitor` class
|
||||||
|
- Never falls back to CPU - raises RuntimeError if GPU unavailable
|
||||||
|
|
||||||
3. **src/core/transcriber.py** - Core transcription logic (shared by both servers)
|
**GPU Auto-Reset** (`gpu_reset.py`):
|
||||||
- `transcribe_audio()`:39 - Single file transcription with environment variable support
|
- Automatically resets GPU drivers via `reset_gpu.sh` when health checks fail
|
||||||
- `batch_transcribe()`:209 - Batch processing with progress reporting
|
- Cooldown mechanism (default 5 min via `GPU_RESET_COOLDOWN_MINUTES`)
|
||||||
- All parameters support environment variable defaults (lines 21-37)
|
- Sudo required - script unloads/reloads nvidia kernel modules
|
||||||
- Delegates output formatting to utils.formatters
|
- Integrated with circuit breaker to avoid reset loops
|
||||||
|
|
||||||
4. **src/core/model_manager.py** - Whisper model lifecycle management
|
**Startup Sequence** (`startup.py`):
|
||||||
- `get_whisper_model()`:44 - Returns cached model instances or loads new ones
|
- Common startup logic for both servers
|
||||||
- `test_gpu_driver()`:20 - GPU validation before model loading
|
- Phase 1: GPU health check with optional auto-reset
|
||||||
- **CRITICAL**: GPU-only mode enforced at lines 64-90 (no CPU fallback)
|
- Phase 2: Initialize job queue
|
||||||
- Global `model_instances` dict caches loaded models to prevent reloading
|
- Phase 3: Initialize health monitor (background thread)
|
||||||
- Automatic batch size optimization based on GPU memory (lines 134-147)
|
- Exits on GPU failure unless configured otherwise
|
||||||
|
|
||||||
5. **src/core/job_queue.py** - Async job queue manager
|
**Circuit Breaker** (`circuit_breaker.py`):
|
||||||
- `JobQueue` class manages FIFO queue with background worker thread
|
- States: CLOSED → OPEN → HALF_OPEN → CLOSED
|
||||||
- `submit_job()` - Validates audio, checks GPU health, adds to queue
|
- Configurable failure/success thresholds
|
||||||
- `get_job_status()` - Returns current job status and queue position
|
- Prevents cascading failures and resource exhaustion
|
||||||
- `get_job_result()` - Returns transcription result for completed jobs
|
- Used for GPU health checks and model operations
|
||||||
- Jobs persist to disk as JSON files for crash recovery
|
|
||||||
- Single worker thread processes jobs sequentially (prevents GPU contention)
|
|
||||||
|
|
||||||
6. **src/core/gpu_health.py** - GPU health monitoring
|
### Environment Variables
|
||||||
- `check_gpu_health()`:39 - Real GPU test using tiny model + test audio
|
|
||||||
- `GPUHealthStatus` dataclass contains detailed GPU metrics
|
|
||||||
- **CRITICAL**: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
|
|
||||||
- Prevents silent CPU fallback that would cause 10-100x slowdown
|
|
||||||
- `HealthMonitor` class for periodic background monitoring
|
|
||||||
|
|
||||||
7. **src/utils/audio_processor.py** - Audio file validation and preprocessing
|
Both server scripts set extensive environment variables. Key ones:
|
||||||
- `validate_audio_file()`:15 - Checks file existence, format, and size
|
|
||||||
- `process_audio()`:50 - Decodes audio using faster_whisper's decode_audio
|
|
||||||
|
|
||||||
8. **src/utils/formatters.py** - Output format conversion
|
**GPU/CUDA**:
|
||||||
- `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats
|
- `CUDA_VISIBLE_DEVICES`: GPU index (default: 1)
|
||||||
- All formatters accept segment lists from Whisper output
|
- `LD_LIBRARY_PATH`: CUDA library path
|
||||||
|
- `TRANSCRIPTION_DEVICE`: "cuda" or "auto" (never "cpu")
|
||||||
|
- `TRANSCRIPTION_COMPUTE_TYPE`: "float16", "int8", or "auto"
|
||||||
|
|
||||||
9. **src/utils/test_audio_generator.py** - Test audio generation
|
**Paths**:
|
||||||
- `generate_test_audio()` - Creates synthetic 1-second audio for GPU health checks
|
- `WHISPER_MODEL_DIR`: Where Whisper models are cached
|
||||||
- Uses numpy to generate sine wave, no external audio files needed
|
- `TRANSCRIPTION_OUTPUT_DIR`: Transcription output directory
|
||||||
|
- `TRANSCRIPTION_BATCH_OUTPUT_DIR`: Batch output directory
|
||||||
|
- `JOB_METADATA_DIR`: Job metadata persistence directory
|
||||||
|
|
||||||
10. **src/core/gpu_reset.py** - GPU driver reset with cooldown protection
|
**Transcription Defaults**:
|
||||||
- `reset_gpu_driver()` - Executes reset_gpu.sh script to reload NVIDIA drivers
|
- `TRANSCRIPTION_MODEL`: Model name (default: "large-v3")
|
||||||
- `check_reset_cooldown()` - Validates if enough time has passed since last reset
|
- `TRANSCRIPTION_OUTPUT_FORMAT`: "txt", "vtt", "srt", or "json"
|
||||||
- Cooldown timestamp persists in `/tmp/whisper-gpu-last-reset`
|
- `TRANSCRIPTION_BEAM_SIZE`: Beam search size (default: 5 for API, 2 for MCP)
|
||||||
- Prevents reset loops while allowing recovery from sleep/wake issues
|
- `TRANSCRIPTION_TEMPERATURE`: Sampling temperature (default: 0.0)
|
||||||
|
|
||||||
11. **src/utils/startup.py** - Startup sequence orchestration
|
**Job Queue**:
|
||||||
- `startup_sequence()` - Coordinates GPU health check, queue initialization
|
- `JOB_QUEUE_MAX_SIZE`: Max queued jobs (default: 100 for MCP, 5 for API)
|
||||||
- `cleanup_on_shutdown()` - Cleanup handler for graceful shutdown
|
- `JOB_RETENTION_DAYS`: How long to keep job metadata (default: 7)
|
||||||
- Centralizes startup logic shared by both servers
|
|
||||||
|
|
||||||
12. **src/utils/circuit_breaker.py** - Circuit breaker pattern implementation
|
**Health Monitoring**:
|
||||||
- Provides fault tolerance for external service calls
|
- `GPU_HEALTH_CHECK_ENABLED`: Enable background monitoring (default: true)
|
||||||
- Prevents cascading failures
|
- `GPU_HEALTH_CHECK_INTERVAL_MINUTES`: Check interval (default: 10)
|
||||||
|
- `GPU_HEALTH_TEST_MODEL`: Model for health checks (default: "tiny")
|
||||||
|
- `GPU_RESET_COOLDOWN_MINUTES`: Cooldown between reset attempts (default: 5)
|
||||||
|
|
||||||
13. **src/utils/input_validation.py** - Input validation utilities
|
### API Workflow (Async Jobs)
|
||||||
- Validates and sanitizes user inputs
|
|
||||||
- Security layer for API endpoints
|
|
||||||
|
|
||||||
### Key Architecture Patterns
|
Both MCP and REST API use the same async workflow:
|
||||||
|
|
||||||
- **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
|
1. **Submit job**: `transcribe_async()` returns `job_id` immediately
|
||||||
- **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
|
2. **Poll status**: `get_job_status(job_id)` returns status + queue_position
|
||||||
- **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
|
3. **Get result**: When status="completed", `get_job_result(job_id)` returns transcription
|
||||||
- **Environment Variable Configuration**: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
|
|
||||||
- **GPU-Only Mode**: Service is configured for GPU-only operation. `device="auto"` requires CUDA, `device="cpu"` is rejected (src/core/model_manager.py:64-90)
|
|
||||||
- **Async Job Queue**: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
|
|
||||||
- **GPU Health Monitoring**: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU
|
|
||||||
|
|
||||||
## Environment Variables
|
The job queue processes one job at a time in a background worker thread.
|
||||||
|
|
||||||
All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
|
### Model Loading Strategy
|
||||||
|
|
||||||
**API Server Specific:**
|
- Models are cached in `model_instances` dict (key: model_name + device + compute_type)
|
||||||
- `API_HOST` - API server host (default: 0.0.0.0)
|
- First load downloads model to `WHISPER_MODEL_DIR` (or default cache)
|
||||||
- `API_PORT` - API server port (default: 8000)
|
- GPU health check on model load - may trigger auto-reset if GPU fails
|
||||||
|
- No CPU fallback - raises `RuntimeError` if CUDA unavailable
|
||||||
**Job Queue Configuration (if using async features):**
|
|
||||||
- `JOB_QUEUE_MAX_SIZE` - Maximum queue size (default: 100)
|
|
||||||
- `JOB_METADATA_DIR` - Directory for job metadata JSON files
|
|
||||||
- `JOB_RETENTION_DAYS` - Auto-cleanup old jobs (0=disabled)
|
|
||||||
|
|
||||||
**GPU Health Monitoring:**
|
|
||||||
- `GPU_HEALTH_CHECK_ENABLED` - Enable periodic GPU monitoring (true/false)
|
|
||||||
- `GPU_HEALTH_CHECK_INTERVAL_MINUTES` - Monitoring interval (default: 10)
|
|
||||||
- `GPU_HEALTH_TEST_MODEL` - Model for health checks (default: tiny)
|
|
||||||
|
|
||||||
**GPU Auto-Reset Configuration:**
|
|
||||||
- `GPU_RESET_COOLDOWN_MINUTES` - Minimum time between GPU reset attempts (default: 5 minutes)
|
|
||||||
- Prevents reset loops while allowing recovery from sleep/wake cycles
|
|
||||||
- Auto-reset is **enabled by default**
|
|
||||||
- Service terminates if GPU unavailable after reset attempt
|
|
||||||
|
|
||||||
**Transcription Configuration (shared by both servers):**
|
|
||||||
|
|
||||||
- `CUDA_VISIBLE_DEVICES` - GPU device selection
|
|
||||||
- `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache)
|
|
||||||
- `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions
|
|
||||||
- `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing
|
|
||||||
- `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
|
|
||||||
- `TRANSCRIPTION_DEVICE` - Execution device (cuda, auto) - **NOTE: cpu is rejected in GPU-only mode**
|
|
||||||
- `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto)
|
|
||||||
- `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json)
|
|
||||||
- `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5)
|
|
||||||
- `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0)
|
|
||||||
- `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false)
|
|
||||||
- `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames
|
|
||||||
- `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames
|
|
||||||
- `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set)
|
|
||||||
|
|
||||||
## Supported Configurations
|
|
||||||
|
|
||||||
- **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3
|
|
||||||
- **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac
|
|
||||||
- **Output formats**: vtt, srt, json, txt
|
|
||||||
- **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
|
|
||||||
|
|
||||||
## REST API Endpoints
|
|
||||||
|
|
||||||
The REST API server provides the following HTTP endpoints:
|
|
||||||
|
|
||||||
### GET /
|
|
||||||
Returns API information and available endpoints.
|
|
||||||
|
|
||||||
### GET /health
|
|
||||||
Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`.
|
|
||||||
|
|
||||||
### GET /models
|
|
||||||
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
|
|
||||||
|
|
||||||
### POST /transcribe
|
|
||||||
Transcribe a single audio file that exists on the server.
|
|
||||||
|
|
||||||
**Request Body:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"audio_path": "/path/to/audio.mp3",
|
|
||||||
"model_name": "large-v3",
|
|
||||||
"device": "auto",
|
|
||||||
"compute_type": "auto",
|
|
||||||
"language": "en",
|
|
||||||
"output_format": "txt",
|
|
||||||
"beam_size": 5,
|
|
||||||
"temperature": 0.0,
|
|
||||||
"initial_prompt": null,
|
|
||||||
"output_directory": null
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Response:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"message": "Transcription successful, results saved to: /path/to/output.txt",
|
|
||||||
"output_path": "/path/to/output.txt"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### POST /batch-transcribe
|
|
||||||
Batch transcribe all audio files in a folder.
|
|
||||||
|
|
||||||
**Request Body:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"audio_folder": "/path/to/audio/folder",
|
|
||||||
"output_folder": "/path/to/output",
|
|
||||||
"model_name": "large-v3",
|
|
||||||
"output_format": "txt",
|
|
||||||
...
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Response:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### POST /upload-transcribe
|
|
||||||
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
|
|
||||||
|
|
||||||
**Form Data:**
|
|
||||||
- `file`: Audio file (multipart/form-data)
|
|
||||||
- `model_name`: Model name (default: "large-v3")
|
|
||||||
- `device`: Device (default: "auto")
|
|
||||||
- `output_format`: Output format (default: "txt")
|
|
||||||
- ... (other transcription parameters)
|
|
||||||
|
|
||||||
**Response:** Returns the transcription file for download.
|
|
||||||
|
|
||||||
### API Usage Examples
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Get model information
|
|
||||||
curl http://localhost:8000/models
|
|
||||||
|
|
||||||
# Transcribe existing file (synchronous)
|
|
||||||
curl -X POST http://localhost:8000/transcribe \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
|
|
||||||
|
|
||||||
# Upload and transcribe
|
|
||||||
curl -X POST http://localhost:8000/upload-transcribe \
|
|
||||||
-F "file=@audio.mp3" \
|
|
||||||
-F "output_format=txt" \
|
|
||||||
-F "model_name=large-v3"
|
|
||||||
|
|
||||||
# Async job queue (if enabled)
|
|
||||||
# Submit job
|
|
||||||
curl -X POST http://localhost:8000/jobs \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"audio_path": "/path/to/audio.mp3"}'
|
|
||||||
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}
|
|
||||||
|
|
||||||
# Check status
|
|
||||||
curl http://localhost:8000/jobs/abc-123
|
|
||||||
# Returns: {"status": "running", ...}
|
|
||||||
|
|
||||||
# Get result (when completed)
|
|
||||||
curl http://localhost:8000/jobs/abc-123/result
|
|
||||||
# Returns: transcription text
|
|
||||||
|
|
||||||
# Check GPU health
|
|
||||||
curl http://localhost:8000/health/gpu
|
|
||||||
# Returns: {"gpu_available": true, "gpu_working": true, ...}
|
|
||||||
```
|
|
||||||
|
|
||||||
## GPU Auto-Reset Configuration
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is **enabled by default** and includes cooldown protection to prevent reset loops.
|
|
||||||
|
|
||||||
### How It Works
|
|
||||||
|
|
||||||
1. **Startup Check**: When the service starts, it performs a GPU health check
|
|
||||||
- If CUDA errors detected → automatic reset attempt → retry
|
|
||||||
- If retry fails → service terminates
|
|
||||||
|
|
||||||
2. **Runtime Check**: Before job submission and model loading
|
|
||||||
- If CUDA errors detected → automatic reset attempt → retry
|
|
||||||
- If retry fails → job rejected, service continues
|
|
||||||
|
|
||||||
3. **Cooldown Protection**: Prevents reset loops
|
|
||||||
- Minimum 5 minutes between reset attempts (configurable via `GPU_RESET_COOLDOWN_MINUTES`)
|
|
||||||
- Cooldown persists across restarts (stored in `/tmp/whisper-gpu-last-reset`)
|
|
||||||
- If reset needed but cooldown active → service/job fails immediately
|
|
||||||
|
|
||||||
### Manual GPU Reset
|
|
||||||
|
|
||||||
You can manually reset the GPU anytime:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./reset_gpu.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
Or clear the cooldown to allow immediate reset:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from core.gpu_reset import clear_reset_cooldown
|
|
||||||
clear_reset_cooldown()
|
|
||||||
```
|
|
||||||
|
|
||||||
### Behavior Examples
|
|
||||||
|
|
||||||
**After sleep/wake with GPU issue:**
|
|
||||||
```
|
|
||||||
Service starts → GPU check fails (CUDA error)
|
|
||||||
→ Cooldown OK → Reset drivers → Wait 3s → Retry
|
|
||||||
→ Success → Service continues
|
|
||||||
```
|
|
||||||
|
|
||||||
**Multiple failures (hardware issue):**
|
|
||||||
```
|
|
||||||
First failure → Reset → Retry fails → Job fails
|
|
||||||
Second failure within 5 min → Cooldown active → Fail immediately
|
|
||||||
(Prevents reset loop)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Normal operation:**
|
|
||||||
```
|
|
||||||
No CUDA errors → No resets → Normal performance
|
|
||||||
Reset only happens on actual CUDA failures
|
|
||||||
```
|
|
||||||
|
|
||||||
## Important Implementation Details
|
## Important Implementation Details
|
||||||
|
|
||||||
### GPU-Only Architecture
|
**GPU-Only Architecture**:
|
||||||
- **CRITICAL**: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
|
- All `device="auto"` resolution checks `torch.cuda.is_available()` and raises error if False
|
||||||
- `device="auto"` requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
|
- No silent fallback to CPU anywhere in the codebase
|
||||||
- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
|
- Health checks verify model actually ran on GPU (check `torch.cuda.memory_allocated`)
|
||||||
- If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
|
|
||||||
- **GPU Auto-Reset**: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)
|
|
||||||
|
|
||||||
### Model Management
|
**Thread Safety**:
|
||||||
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
|
- `JobQueue` uses `threading.Lock` for job dictionary access
|
||||||
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
|
- Worker thread processes jobs from `queue.Queue` (thread-safe FIFO)
|
||||||
- Models are cached globally in `model_instances` dict, shared across requests
|
- `HealthMonitor` runs in separate daemon thread
|
||||||
- Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)
|
|
||||||
|
|
||||||
### Transcription Settings
|
**Error Handling**:
|
||||||
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
|
- Circuit breaker prevents retry storms on GPU failures
|
||||||
- Word timestamps are enabled by default (src/core/transcriber.py:107)
|
- Input validation rejects invalid audio files, model names, languages
|
||||||
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
|
- Job errors are captured and stored in job metadata with status=FAILED
|
||||||
- Default output format is "txt" for REST API, configured via environment variables for MCP server
|
|
||||||
|
|
||||||
### Async Job Queue (if enabled)
|
**Shutdown Handling**:
|
||||||
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
|
- `cleanup_on_shutdown()` waits for current job to complete
|
||||||
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
|
- Stops health monitor thread
|
||||||
- Queue has max size limit (default 100), returns 503 when full
|
- Saves final job states to disk
|
||||||
- Job status polling recommended every 5-10 seconds for LLM agents
|
|
||||||
|
|
||||||
## Development Workflow
|
## Common Development Tasks
|
||||||
|
|
||||||
### Running Tests
|
**Adding a new output format**:
|
||||||
|
1. Add formatter function in `src/utils/formatters.py`
|
||||||
|
2. Add case in `transcribe_audio()` in `src/core/transcriber.py`
|
||||||
|
3. Update API docs and MCP tool descriptions
|
||||||
|
|
||||||
The test suite requires GPU access. Ensure CUDA is properly configured before running tests.
|
**Adjusting GPU health check behavior**:
|
||||||
|
1. Modify circuit breaker params in `src/core/gpu_health.py`
|
||||||
|
2. Adjust health check interval in environment variables
|
||||||
|
3. Consider cooldown timing in `src/core/gpu_reset.py`
|
||||||
|
|
||||||
```bash
|
**Testing GPU reset logic**:
|
||||||
# Set PYTHONPATH to include src directory
|
1. Manually trigger GPU failure (e.g., occupy all GPU memory)
|
||||||
export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
|
2. Watch logs for circuit breaker state transitions
|
||||||
|
3. Verify reset attempt with cooldown enforcement
|
||||||
|
4. Check `nvidia-smi` before/after reset
|
||||||
|
|
||||||
# Run core component tests (GPU health, job queue, audio validation)
|
**Debugging job queue issues**:
|
||||||
python tests/test_core_components.py
|
1. Check job metadata files in `JOB_METADATA_DIR`
|
||||||
|
2. Look for lock contention in logs
|
||||||
# Run end-to-end integration tests
|
3. Verify worker thread is running (check logs for "Job queue worker started")
|
||||||
python tests/test_e2e_integration.py
|
4. Test with `JOB_QUEUE_MAX_SIZE=1` to isolate serialization
|
||||||
|
|
||||||
# Run async API integration tests
|
|
||||||
python tests/test_async_api_integration.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Tests will automatically:
|
|
||||||
- Check for GPU availability (exit if not available)
|
|
||||||
- Validate audio file processing
|
|
||||||
- Test GPU health monitoring
|
|
||||||
- Test job queue operations
|
|
||||||
- Test transcription pipeline
|
|
||||||
|
|
||||||
### Testing GPU Health
|
|
||||||
```python
|
|
||||||
# Test GPU health check manually
|
|
||||||
from src.core.gpu_health import check_gpu_health
|
|
||||||
|
|
||||||
status = check_gpu_health(expected_device="cuda")
|
|
||||||
print(f"GPU Working: {status.gpu_working}")
|
|
||||||
print(f"Device: {status.device_used}")
|
|
||||||
print(f"Test Duration: {status.test_duration_seconds}s")
|
|
||||||
# Expected: <1s for GPU, 3-10s for CPU
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing Job Queue
|
|
||||||
```python
|
|
||||||
# Test job queue manually
|
|
||||||
from src.core.job_queue import JobQueue
|
|
||||||
|
|
||||||
queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
|
|
||||||
queue.start()
|
|
||||||
|
|
||||||
# Submit job
|
|
||||||
job_info = queue.submit_job(
|
|
||||||
audio_path="/path/to/test.mp3",
|
|
||||||
model_name="large-v3",
|
|
||||||
device="cuda"
|
|
||||||
)
|
|
||||||
print(f"Job ID: {job_info['job_id']}")
|
|
||||||
|
|
||||||
# Poll status
|
|
||||||
status = queue.get_job_status(job_info['job_id'])
|
|
||||||
print(f"Status: {status['status']}")
|
|
||||||
|
|
||||||
# Get result when completed
|
|
||||||
result = queue.get_job_result(job_info['job_id'])
|
|
||||||
```
|
|
||||||
|
|
||||||
### Common Debugging
|
|
||||||
|
|
||||||
**Model loading issues:**
|
|
||||||
- Check `WHISPER_MODEL_DIR` is set correctly
|
|
||||||
- Verify GPU memory with `nvidia-smi`
|
|
||||||
- Check logs for GPU driver test failures at model_manager.py:112-114
|
|
||||||
|
|
||||||
**GPU not detected:**
|
|
||||||
- Verify `CUDA_VISIBLE_DEVICES` is set correctly
|
|
||||||
- Check `torch.cuda.is_available()` returns True
|
|
||||||
- Run GPU health check to see detailed error
|
|
||||||
|
|
||||||
**Silent failures:**
|
|
||||||
- Check that service is NOT silently falling back to CPU
|
|
||||||
- GPU health check should RAISE errors, not log warnings
|
|
||||||
- If device=cuda fails, the job should be rejected, not processed on CPU
|
|
||||||
|
|
||||||
**Job queue issues:**
|
|
||||||
- Check `JOB_METADATA_DIR` exists and is writable
|
|
||||||
- Verify background worker thread is running (check logs)
|
|
||||||
- Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json
|
|
||||||
|
|
||||||
### File Locations
|
|
||||||
|
|
||||||
- **Logs**: `mcp.logs` (MCP server), `api.logs` (API server)
|
|
||||||
- **Models**: `$WHISPER_MODEL_DIR` or HuggingFace cache
|
|
||||||
- **Outputs**: `$TRANSCRIPTION_OUTPUT_DIR` or `$TRANSCRIPTION_BATCH_OUTPUT_DIR`
|
|
||||||
- **Job Metadata**: `$JOB_METADATA_DIR/{job_id}.json`
|
|
||||||
|
|
||||||
### Important Development Notes
|
|
||||||
|
|
||||||
- See `DEV_PLAN.md` for detailed architecture and implementation plan for async job queue features
|
|
||||||
- The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
|
|
||||||
- When modifying model_manager.py, maintain the strict GPU-only enforcement
|
|
||||||
- When adding new endpoints, follow the async pattern if transcription time >30 seconds
|
|
||||||
|
|||||||
Reference in New Issue
Block a user