# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:

1. **MCP Server** - For integration with Claude Desktop and other MCP clients
2. **REST API Server** - For HTTP-based integrations with async job queue support

Both servers share the same core transcription logic and can run independently or simultaneously on different ports.

**Key Features:**
- Async job queue system for long-running transcriptions (prevents HTTP timeouts)
- GPU health monitoring with strict failure detection (prevents silent CPU fallback)
- **Automatic GPU driver reset** on CUDA errors with cooldown protection (handles sleep/wake issues)
- Dual-server architecture (MCP + REST API)
- Model caching for fast repeated transcriptions
- Automatic batch size optimization based on GPU memory

## Development Commands

### Environment Setup
```bash
# Create and activate virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA 12.6 support
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
```

### Running the Servers

#### MCP Server (for Claude Desktop)

```bash
# Using the startup script (recommended - sets all env vars)
./run_mcp_server.sh

# Direct Python execution
python whisper_server.py

# Using MCP CLI for development testing
mcp dev whisper_server.py

# Run server with MCP CLI
mcp run whisper_server.py
```

#### REST API Server (for HTTP clients)

```bash
# Using the startup script (recommended - sets all env vars)
./run_api_server.sh

# Direct Python execution with uvicorn
python api_server.py

# Or using uvicorn directly
uvicorn api_server:app --host 0.0.0.0 --port 8000

# Development mode with auto-reload
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000
```

#### Running Both Simultaneously

```bash
# Terminal 1: Start MCP server
./run_mcp_server.sh

# Terminal 2: Start REST API server
./run_api_server.sh
```

### Docker

```bash
# Build Docker image
docker build -t whisper-mcp-server .

# Run with GPU support
docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
```

## Architecture

### Directory Structure

```
.
├── src/                          # Source code directory
│   ├── servers/                  # Server implementations
│   │   ├── whisper_server.py    # MCP server entry point
│   │   └── api_server.py        # REST API server (async job queue)
│   ├── core/                     # Core business logic
│   │   ├── transcriber.py       # Transcription logic (single & batch)
│   │   ├── model_manager.py     # Model lifecycle & caching
│   │   ├── job_queue.py         # Async job queue manager
│   │   └── gpu_health.py        # GPU health monitoring
│   └── utils/                    # Utility modules
│       ├── audio_processor.py   # Audio validation & preprocessing
│       ├── formatters.py        # Output format conversion
│       └── test_audio_generator.py # Test audio generation for GPU checks
├── run_mcp_server.sh            # MCP server startup script
├── run_api_server.sh            # API server startup script
├── reset_gpu.sh                 # GPU driver reset script
├── DEV_PLAN.md                  # Development plan for async features
├── requirements.txt              # Python dependencies
└── pyproject.toml               # Project configuration
```

### Core Components

1. **src/servers/whisper_server.py** - MCP server entry point
   - Uses FastMCP framework to expose MCP tools
   - Three main tools: `get_model_info_api()`, `transcribe()`, `batch_transcribe_audio()`
   - Server initialization at line 19

2. **src/servers/api_server.py** - REST API server entry point
   - Uses FastAPI framework for HTTP endpoints
   - Provides REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe`
   - Shares core transcription logic with MCP server
   - File upload support via multipart/form-data

3. **src/core/transcriber.py** - Core transcription logic (shared by both servers)
   - `transcribe_audio()`:39 - Single file transcription with environment variable support
   - `batch_transcribe()`:209 - Batch processing with progress reporting
   - All parameters support environment variable defaults (lines 21-37)
   - Delegates output formatting to utils.formatters

4. **src/core/model_manager.py** - Whisper model lifecycle management
   - `get_whisper_model()`:44 - Returns cached model instances or loads new ones
   - `test_gpu_driver()`:20 - GPU validation before model loading
   - **CRITICAL**: GPU-only mode enforced at lines 64-90 (no CPU fallback)
   - Global `model_instances` dict caches loaded models to prevent reloading
   - Automatic batch size optimization based on GPU memory (lines 134-147)

5. **src/core/job_queue.py** - Async job queue manager
   - `JobQueue` class manages FIFO queue with background worker thread
   - `submit_job()` - Validates audio, checks GPU health, adds to queue
   - `get_job_status()` - Returns current job status and queue position
   - `get_job_result()` - Returns transcription result for completed jobs
   - Jobs persist to disk as JSON files for crash recovery
   - Single worker thread processes jobs sequentially (prevents GPU contention)

6. **src/core/gpu_health.py** - GPU health monitoring
   - `check_gpu_health()`:39 - Real GPU test using tiny model + test audio
   - `GPUHealthStatus` dataclass contains detailed GPU metrics
   - **CRITICAL**: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
   - Prevents silent CPU fallback that would cause 10-100x slowdown
   - `HealthMonitor` class for periodic background monitoring

7. **src/utils/audio_processor.py** - Audio file validation and preprocessing
   - `validate_audio_file()`:15 - Checks file existence, format, and size
   - `process_audio()`:50 - Decodes audio using faster_whisper's decode_audio

8. **src/utils/formatters.py** - Output format conversion
   - `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats
   - All formatters accept segment lists from Whisper output

9. **src/utils/test_audio_generator.py** - Test audio generation
   - `generate_test_audio()` - Creates synthetic 1-second audio for GPU health checks
   - Uses numpy to generate sine wave, no external audio files needed

### Key Architecture Patterns

- **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
- **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
- **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
- **Environment Variable Configuration**: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
- **GPU-Only Mode**: Service is configured for GPU-only operation. `device="auto"` requires CUDA, `device="cpu"` is rejected (src/core/model_manager.py:64-90)
- **Async Job Queue**: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
- **GPU Health Monitoring**: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU

## Environment Variables

All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:

**API Server Specific:**
- `API_HOST` - API server host (default: 0.0.0.0)
- `API_PORT` - API server port (default: 8000)

**Job Queue Configuration (if using async features):**
- `JOB_QUEUE_MAX_SIZE` - Maximum queue size (default: 100)
- `JOB_METADATA_DIR` - Directory for job metadata JSON files
- `JOB_RETENTION_DAYS` - Auto-cleanup old jobs (0=disabled)

**GPU Health Monitoring:**
- `GPU_HEALTH_CHECK_ENABLED` - Enable periodic GPU monitoring (true/false)
- `GPU_HEALTH_CHECK_INTERVAL_MINUTES` - Monitoring interval (default: 10)
- `GPU_HEALTH_TEST_MODEL` - Model for health checks (default: tiny)

**GPU Auto-Reset Configuration:**
- `GPU_RESET_COOLDOWN_MINUTES` - Minimum time between GPU reset attempts (default: 5 minutes)
  - Prevents reset loops while allowing recovery from sleep/wake cycles
  - Auto-reset is **enabled by default**
  - Service terminates if GPU unavailable after reset attempt

**Transcription Configuration (shared by both servers):**

- `CUDA_VISIBLE_DEVICES` - GPU device selection
- `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache)
- `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions
- `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing
- `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
- `TRANSCRIPTION_DEVICE` - Execution device (cuda, auto) - **NOTE: cpu is rejected in GPU-only mode**
- `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto)
- `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json)
- `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5)
- `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0)
- `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false)
- `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames
- `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames
- `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set)

## Supported Configurations

- **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3
- **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac
- **Output formats**: vtt, srt, json, txt
- **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)

## REST API Endpoints

The REST API server provides the following HTTP endpoints:

### GET /
Returns API information and available endpoints.

### GET /health
Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`.

### GET /models
Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).

### POST /transcribe
Transcribe a single audio file that exists on the server.

**Request Body:**
```json
{
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
}
```

**Response:**
```json
{
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
}
```

### POST /batch-transcribe
Batch transcribe all audio files in a folder.

**Request Body:**
```json
{
  "audio_folder": "/path/to/audio/folder",
  "output_folder": "/path/to/output",
  "model_name": "large-v3",
  "output_format": "txt",
  ...
}
```

**Response:**
```json
{
  "success": true,
  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
}
```

### POST /upload-transcribe
Upload an audio file and transcribe it immediately. Returns the transcription file as a download.

**Form Data:**
- `file`: Audio file (multipart/form-data)
- `model_name`: Model name (default: "large-v3")
- `device`: Device (default: "auto")
- `output_format`: Output format (default: "txt")
- ... (other transcription parameters)

**Response:** Returns the transcription file for download.

### API Usage Examples

```bash
# Get model information
curl http://localhost:8000/models

# Transcribe existing file (synchronous)
curl -X POST http://localhost:8000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'

# Upload and transcribe
curl -X POST http://localhost:8000/upload-transcribe \
  -F "file=@audio.mp3" \
  -F "output_format=txt" \
  -F "model_name=large-v3"

# Async job queue (if enabled)
# Submit job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3"}'
# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}

# Check status
curl http://localhost:8000/jobs/abc-123
# Returns: {"status": "running", ...}

# Get result (when completed)
curl http://localhost:8000/jobs/abc-123/result
# Returns: transcription text

# Check GPU health
curl http://localhost:8000/health/gpu
# Returns: {"gpu_available": true, "gpu_working": true, ...}
```

## GPU Auto-Reset Configuration

### Overview
This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is **enabled by default** and includes cooldown protection to prevent reset loops.

### Passwordless Sudo Setup (Required)

For automatic GPU reset to work, you must configure passwordless sudo for NVIDIA commands. Create a sudoers configuration file:

```bash
sudo visudo -f /etc/sudoers.d/whisper-gpu-reset
```

Add the following (replace `your_username` with your actual username):

```
# Whisper GPU Auto-Reset Permissions
your_username ALL=(ALL) NOPASSWD: /bin/systemctl stop nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /bin/systemctl start nvidia-persistenced
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_drm
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_modeset
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_uvm
your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_drm
```

**Security Note:** These permissions are limited to specific NVIDIA driver commands only. The reset script (`reset_gpu.sh`) is executed with sudo but is part of the codebase and can be audited.

### How It Works

1. **Startup Check**: When the service starts, it performs a GPU health check
   - If CUDA errors detected → automatic reset attempt → retry
   - If retry fails → service terminates

2. **Runtime Check**: Before job submission and model loading
   - If CUDA errors detected → automatic reset attempt → retry
   - If retry fails → job rejected, service continues

3. **Cooldown Protection**: Prevents reset loops
   - Minimum 5 minutes between reset attempts (configurable via `GPU_RESET_COOLDOWN_MINUTES`)
   - Cooldown persists across restarts (stored in `/tmp/whisper-gpu-last-reset`)
   - If reset needed but cooldown active → service/job fails immediately

### Manual GPU Reset

You can manually reset the GPU anytime:

```bash
./reset_gpu.sh
```

Or clear the cooldown to allow immediate reset:

```python
from core.gpu_reset import clear_reset_cooldown
clear_reset_cooldown()
```

### Behavior Examples

**After sleep/wake with GPU issue:**
```
Service starts → GPU check fails (CUDA error)
→ Cooldown OK → Reset drivers → Wait 3s → Retry
→ Success → Service continues
```

**Multiple failures (hardware issue):**
```
First failure → Reset → Retry fails → Job fails
Second failure within 5 min → Cooldown active → Fail immediately
(Prevents reset loop)
```

**Normal operation:**
```
No CUDA errors → No resets → Normal performance
Reset only happens on actual CUDA failures
```

## Important Implementation Details

### GPU-Only Architecture
- **CRITICAL**: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
- `device="auto"` requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
- If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
- **GPU Auto-Reset**: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)

### Model Management
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
- Models are cached globally in `model_instances` dict, shared across requests
- Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)

### Transcription Settings
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
- Word timestamps are enabled by default (src/core/transcriber.py:107)
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
- Default output format is "txt" for REST API, configured via environment variables for MCP server

### Async Job Queue (if enabled)
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
- Queue has max size limit (default 100), returns 503 when full
- Job status polling recommended every 5-10 seconds for LLM agents

## Development Workflow

### Testing GPU Health
```python
# Test GPU health check manually
from src.core.gpu_health import check_gpu_health

status = check_gpu_health(expected_device="cuda")
print(f"GPU Working: {status.gpu_working}")
print(f"Device: {status.device_used}")
print(f"Test Duration: {status.test_duration_seconds}s")
# Expected: <1s for GPU, 3-10s for CPU
```

### Testing Job Queue
```python
# Test job queue manually
from src.core.job_queue import JobQueue

queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
queue.start()

# Submit job
job_info = queue.submit_job(
    audio_path="/path/to/test.mp3",
    model_name="large-v3",
    device="cuda"
)
print(f"Job ID: {job_info['job_id']}")

# Poll status
status = queue.get_job_status(job_info['job_id'])
print(f"Status: {status['status']}")

# Get result when completed
result = queue.get_job_result(job_info['job_id'])
```

### Common Debugging

**Model loading issues:**
- Check `WHISPER_MODEL_DIR` is set correctly
- Verify GPU memory with `nvidia-smi`
- Check logs for GPU driver test failures at model_manager.py:112-114

**GPU not detected:**
- Verify `CUDA_VISIBLE_DEVICES` is set correctly
- Check `torch.cuda.is_available()` returns True
- Run GPU health check to see detailed error

**Silent failures:**
- Check that service is NOT silently falling back to CPU
- GPU health check should RAISE errors, not log warnings
- If device=cuda fails, the job should be rejected, not processed on CPU

**Job queue issues:**
- Check `JOB_METADATA_DIR` exists and is writable
- Verify background worker thread is running (check logs)
- Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json

### File Locations

- **Logs**: `mcp.logs` (MCP server), `api.logs` (API server)
- **Models**: `$WHISPER_MODEL_DIR` or HuggingFace cache
- **Outputs**: `$TRANSCRIPTION_OUTPUT_DIR` or `$TRANSCRIPTION_BATCH_OUTPUT_DIR`
- **Job Metadata**: `$JOB_METADATA_DIR/{job_id}.json`

### Important Development Notes

- See `DEV_PLAN.md` for detailed architecture and implementation plan for async job queue features
- The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
- When modifying model_manager.py, maintain the strict GPU-only enforcement
- When adding new endpoints, follow the async pattern if transcription time >30 seconds