update claude md

2025-10-10 01:49:48 +03:00
parent 66b36e71e8
commit 06b8bc1304
1 changed files with 163 additions and 515 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -2,565 +2,213 @@
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-## Overview
+## Project Overview
-This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:
+**fast-whisper-mcp-server** is a high-performance audio transcription service built on faster-whisper with dual-server architecture:
 - **MCP Server** (`whisper_server.py`): Model Context Protocol interface for LLM integration
 - **REST API Server** (`api_server.py`): HTTP REST endpoints with FastAPI
-1. **MCP Server** - For integration with Claude Desktop and other MCP clients
+The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. **GPU is required** - there is no CPU fallback.
 2. **REST API Server** - For HTTP-based integrations with async job queue support
-Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
+## Core Commands
-**Key Features:**
+### Running Servers
 - Async job queue system for long-running transcriptions (prevents HTTP timeouts)
 - GPU health monitoring with strict failure detection (prevents silent CPU fallback)
 - **Automatic GPU driver reset** on CUDA errors with cooldown protection (handles sleep/wake issues)
 - Dual-server architecture (MCP + REST API)
 - Model caching for fast repeated transcriptions
 - Automatic batch size optimization based on GPU memory
 ## Development Commands
 ### Environment Setup
 ```bash
-# Create and activate virtual environment
+# MCP Server (for LLM integration via MCP)
 ./run_mcp_server.sh
 # REST API Server (for HTTP clients)
 ./run_api_server.sh
 # Both servers log to mcp.logs and api.logs respectively
 ```
 ### Testing
 ```bash
 # Run core component tests (GPU health, job queue, validation)
 python tests/test_core_components.py
 # Run async API integration tests
 python tests/test_async_api_integration.py
 # Run end-to-end integration tests
 python tests/test_e2e_integration.py
 ```
 ### GPU Management
 ```bash
 # Reset GPU drivers without rebooting (requires sudo)
 ./reset_gpu.sh
 # Check GPU status
 nvidia-smi
 # Monitor GPU during transcription
 watch -n 1 nvidia-smi
 ```
 ### Installation
 ```bash
 # Create virtual environment
 python3.12 -m venv venv
 source venv/bin/activate
-# Install dependencies
+# Install dependencies (check requirements.txt for CUDA-specific instructions)
 pip install -r requirements.txt
-# Install PyTorch with CUDA 12.6 support
+# For CUDA 12.4:
-pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
 # For CUDA 12.1
 pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
 # For CPU-only
 pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
 ```
 ### Running the Servers
 #### MCP Server (for Claude Desktop)
 ```bash
 # Using the startup script (recommended - sets all env vars)
 ./run_mcp_server.sh
 # Direct Python execution (ensure PYTHONPATH includes src/)
 export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
 python src/servers/whisper_server.py
 # Using MCP CLI for development testing
 mcp dev src/servers/whisper_server.py
 ```
 #### REST API Server (for HTTP clients)
 ```bash
 # Using the startup script (recommended - sets all env vars)
 ./run_api_server.sh
 # Direct Python execution with uvicorn (ensure PYTHONPATH includes src/)
 export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
 uvicorn src.servers.api_server:app --host 0.0.0.0 --port 8000
 # Development mode with auto-reload
 uvicorn src.servers.api_server:app --reload --host 0.0.0.0 --port 8000
 ```
 #### Running Both Simultaneously
 ```bash
 # Terminal 1: Start MCP server
 ./run_mcp_server.sh
 # Terminal 2: Start REST API server
 ./run_api_server.sh
 ```
 ### Running Tests
 ```bash
 # Run all tests (requires GPU)
 python tests/test_core_components.py
 python tests/test_e2e_integration.py
 python tests/test_async_api_integration.py
 # Or run individual test components
 cd tests && python test_core_components.py
 ```
 **Important**: All tests require GPU to be available. Tests will fail if CUDA is not properly configured.
 ### Docker
 ```bash
 # Build Docker image
 docker build -t whisper-mcp-server .
 # Run with GPU support
 docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
 ```
 ## Architecture
 ### Directory Structure
 ```
-.
+src/
-├── src/                          # Source code directory
+├── core/              # Core business logic
-│   ├── servers/                  # Server implementations
+│   ├── transcriber.py     # Main transcription logic with env var defaults
-│   │   ├── whisper_server.py    # MCP server entry point
+│   ├── model_manager.py   # Whisper model loading/caching (GPU-only)
-│   │   └── api_server.py        # REST API server (async job queue)
+│   ├── job_queue.py       # Async FIFO job queue with worker thread
-│   ├── core/                     # Core business logic
+│   ├── gpu_health.py      # Real GPU health checks with circuit breaker
-│   │   ├── transcriber.py       # Transcription logic (single & batch)
+│   └── gpu_reset.py       # Automatic GPU driver reset logic
-│   │   ├── model_manager.py     # Model lifecycle & caching
+├── servers/           # Server implementations
-│   │   ├── job_queue.py         # Async job queue manager
+│   ├── whisper_server.py  # MCP server (stdio transport)
-│   │   ├── gpu_health.py        # GPU health monitoring
+│   └── api_server.py      # FastAPI REST server
-│   │   └── gpu_reset.py         # GPU driver reset with cooldown
+└── utils/             # Utilities
-│   └── utils/                    # Utility modules
+    ├── startup.py         # Common startup sequence (GPU check, initialization)
-│       ├── audio_processor.py   # Audio validation & preprocessing
+    ├── circuit_breaker.py # Circuit breaker pattern implementation
-│       ├── formatters.py        # Output format conversion
+    ├── audio_processor.py # Audio file validation
-│       ├── test_audio_generator.py # Test audio generation for GPU checks
+    ├── formatters.py      # Output format handlers (txt, vtt, srt, json)
-│       ├── startup.py           # Startup sequence orchestration
+    ├── input_validation.py# Input validation utilities
-│       ├── circuit_breaker.py   # Circuit breaker pattern implementation
+    └── test_audio_generator.py  # Generate test audio for health checks
 │       └── input_validation.py  # Input validation utilities
 ├── tests/                        # Test suite (requires GPU)
 │   ├── test_core_components.py  # Core functionality tests
 │   ├── test_e2e_integration.py  # End-to-end integration tests
 │   └── test_async_api_integration.py # Async API tests
 ├── run_mcp_server.sh            # MCP server startup script
 ├── run_api_server.sh            # API server startup script
 ├── reset_gpu.sh                 # GPU driver reset script
 ├── DEV_PLAN.md                  # Development plan for async features
 ├── requirements.txt              # Python dependencies
 └── pyproject.toml               # Project configuration
 ```
-### Core Components
+### Key Architectural Patterns
-1. **src/servers/whisper_server.py** - MCP server entry point
+**Async Job Queue** (`job_queue.py`):
-   - Uses FastMCP framework to expose MCP tools
+- FIFO queue with background worker thread
-   - Main tools: `get_model_info_api()`, `transcribe_async()`, `transcribe_upload()`, `check_job_status()`, `get_job_result()`
+- Disk persistence of job metadata to `JOB_METADATA_DIR`
-   - Global job queue and health monitor instances
+- States: QUEUED → RUNNING → COMPLETED/FAILED
-   - Server initialization around line 31
+- Jobs include full request params + results
 - Thread-safe operations with locks
-2. **src/servers/api_server.py** - REST API server entry point
+**GPU Health Monitoring** (`gpu_health.py`):
-   - Uses FastAPI framework for HTTP endpoints
+- Performs **real** GPU checks: loads tiny model + transcribes test audio
-   - Provides REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe`
+- Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
-   - Shares core transcription logic with MCP server
+- Integration with GPU auto-reset on failures
-   - File upload support via multipart/form-data
+- Background monitoring thread in `HealthMonitor` class
 - Never falls back to CPU - raises RuntimeError if GPU unavailable
-3. **src/core/transcriber.py** - Core transcription logic (shared by both servers)
+**GPU Auto-Reset** (`gpu_reset.py`):
-   - `transcribe_audio()`:39 - Single file transcription with environment variable support
+- Automatically resets GPU drivers via `reset_gpu.sh` when health checks fail
-   - `batch_transcribe()`:209 - Batch processing with progress reporting
+- Cooldown mechanism (default 5 min via `GPU_RESET_COOLDOWN_MINUTES`)
-   - All parameters support environment variable defaults (lines 21-37)
+- Sudo required - script unloads/reloads nvidia kernel modules
-   - Delegates output formatting to utils.formatters
+- Integrated with circuit breaker to avoid reset loops
-4. **src/core/model_manager.py** - Whisper model lifecycle management
+**Startup Sequence** (`startup.py`):
-   - `get_whisper_model()`:44 - Returns cached model instances or loads new ones
+- Common startup logic for both servers
-   - `test_gpu_driver()`:20 - GPU validation before model loading
+- Phase 1: GPU health check with optional auto-reset
-   - **CRITICAL**: GPU-only mode enforced at lines 64-90 (no CPU fallback)
+- Phase 2: Initialize job queue
-   - Global `model_instances` dict caches loaded models to prevent reloading
+- Phase 3: Initialize health monitor (background thread)
-   - Automatic batch size optimization based on GPU memory (lines 134-147)
+- Exits on GPU failure unless configured otherwise
-5. **src/core/job_queue.py** - Async job queue manager
+**Circuit Breaker** (`circuit_breaker.py`):
-   - `JobQueue` class manages FIFO queue with background worker thread
+- States: CLOSED → OPEN → HALF_OPEN → CLOSED
-   - `submit_job()` - Validates audio, checks GPU health, adds to queue
+- Configurable failure/success thresholds
-   - `get_job_status()` - Returns current job status and queue position
+- Prevents cascading failures and resource exhaustion
-   - `get_job_result()` - Returns transcription result for completed jobs
+- Used for GPU health checks and model operations
   - Jobs persist to disk as JSON files for crash recovery
   - Single worker thread processes jobs sequentially (prevents GPU contention)
-6. **src/core/gpu_health.py** - GPU health monitoring
+### Environment Variables
   - `check_gpu_health()`:39 - Real GPU test using tiny model + test audio
   - `GPUHealthStatus` dataclass contains detailed GPU metrics
   - **CRITICAL**: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
   - Prevents silent CPU fallback that would cause 10-100x slowdown
   - `HealthMonitor` class for periodic background monitoring
-7. **src/utils/audio_processor.py** - Audio file validation and preprocessing
+Both server scripts set extensive environment variables. Key ones:
   - `validate_audio_file()`:15 - Checks file existence, format, and size
   - `process_audio()`:50 - Decodes audio using faster_whisper's decode_audio
-8. **src/utils/formatters.py** - Output format conversion
+**GPU/CUDA**:
-   - `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats
+- `CUDA_VISIBLE_DEVICES`: GPU index (default: 1)
-   - All formatters accept segment lists from Whisper output
+- `LD_LIBRARY_PATH`: CUDA library path
 - `TRANSCRIPTION_DEVICE`: "cuda" or "auto" (never "cpu")
 - `TRANSCRIPTION_COMPUTE_TYPE`: "float16", "int8", or "auto"
-9. **src/utils/test_audio_generator.py** - Test audio generation
+**Paths**:
-   - `generate_test_audio()` - Creates synthetic 1-second audio for GPU health checks
+- `WHISPER_MODEL_DIR`: Where Whisper models are cached
-   - Uses numpy to generate sine wave, no external audio files needed
+- `TRANSCRIPTION_OUTPUT_DIR`: Transcription output directory
 - `TRANSCRIPTION_BATCH_OUTPUT_DIR`: Batch output directory
 - `JOB_METADATA_DIR`: Job metadata persistence directory
-10. **src/core/gpu_reset.py** - GPU driver reset with cooldown protection
+**Transcription Defaults**:
-    - `reset_gpu_driver()` - Executes reset_gpu.sh script to reload NVIDIA drivers
+- `TRANSCRIPTION_MODEL`: Model name (default: "large-v3")
-    - `check_reset_cooldown()` - Validates if enough time has passed since last reset
+- `TRANSCRIPTION_OUTPUT_FORMAT`: "txt", "vtt", "srt", or "json"
-    - Cooldown timestamp persists in `/tmp/whisper-gpu-last-reset`
+- `TRANSCRIPTION_BEAM_SIZE`: Beam search size (default: 5 for API, 2 for MCP)
-    - Prevents reset loops while allowing recovery from sleep/wake issues
+- `TRANSCRIPTION_TEMPERATURE`: Sampling temperature (default: 0.0)
-11. **src/utils/startup.py** - Startup sequence orchestration
+**Job Queue**:
-    - `startup_sequence()` - Coordinates GPU health check, queue initialization
+- `JOB_QUEUE_MAX_SIZE`: Max queued jobs (default: 100 for MCP, 5 for API)
-    - `cleanup_on_shutdown()` - Cleanup handler for graceful shutdown
+- `JOB_RETENTION_DAYS`: How long to keep job metadata (default: 7)
    - Centralizes startup logic shared by both servers
-12. **src/utils/circuit_breaker.py** - Circuit breaker pattern implementation
+**Health Monitoring**:
-    - Provides fault tolerance for external service calls
+- `GPU_HEALTH_CHECK_ENABLED`: Enable background monitoring (default: true)
-    - Prevents cascading failures
+- `GPU_HEALTH_CHECK_INTERVAL_MINUTES`: Check interval (default: 10)
 - `GPU_HEALTH_TEST_MODEL`: Model for health checks (default: "tiny")
 - `GPU_RESET_COOLDOWN_MINUTES`: Cooldown between reset attempts (default: 5)
-13. **src/utils/input_validation.py** - Input validation utilities
+### API Workflow (Async Jobs)
    - Validates and sanitizes user inputs
    - Security layer for API endpoints
-### Key Architecture Patterns
+Both MCP and REST API use the same async workflow:
- **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
+1. **Submit job**: `transcribe_async()` returns `job_id` immediately
- **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
+2. **Poll status**: `get_job_status(job_id)` returns status + queue_position
- **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
+3. **Get result**: When status="completed", `get_job_result(job_id)` returns transcription
 - **Environment Variable Configuration**: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
 - **GPU-Only Mode**: Service is configured for GPU-only operation. `device="auto"` requires CUDA, `device="cpu"` is rejected (src/core/model_manager.py:64-90)
 - **Async Job Queue**: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
 - **GPU Health Monitoring**: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU
-## Environment Variables
+The job queue processes one job at a time in a background worker thread.
-All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
+### Model Loading Strategy
-**API Server Specific:**
+- Models are cached in `model_instances` dict (key: model_name + device + compute_type)
- `API_HOST` - API server host (default: 0.0.0.0)
+- First load downloads model to `WHISPER_MODEL_DIR` (or default cache)
- `API_PORT` - API server port (default: 8000)
+- GPU health check on model load - may trigger auto-reset if GPU fails
-
+- No CPU fallback - raises `RuntimeError` if CUDA unavailable
 **Job Queue Configuration (if using async features):**
 - `JOB_QUEUE_MAX_SIZE` - Maximum queue size (default: 100)
 - `JOB_METADATA_DIR` - Directory for job metadata JSON files
 - `JOB_RETENTION_DAYS` - Auto-cleanup old jobs (0=disabled)
 **GPU Health Monitoring:**
 - `GPU_HEALTH_CHECK_ENABLED` - Enable periodic GPU monitoring (true/false)
 - `GPU_HEALTH_CHECK_INTERVAL_MINUTES` - Monitoring interval (default: 10)
 - `GPU_HEALTH_TEST_MODEL` - Model for health checks (default: tiny)
 **GPU Auto-Reset Configuration:**
 - `GPU_RESET_COOLDOWN_MINUTES` - Minimum time between GPU reset attempts (default: 5 minutes)
  - Prevents reset loops while allowing recovery from sleep/wake cycles
  - Auto-reset is **enabled by default**
  - Service terminates if GPU unavailable after reset attempt
 **Transcription Configuration (shared by both servers):**
 - `CUDA_VISIBLE_DEVICES` - GPU device selection
 - `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache)
 - `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions
 - `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing
 - `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
 - `TRANSCRIPTION_DEVICE` - Execution device (cuda, auto) - **NOTE: cpu is rejected in GPU-only mode**
 - `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto)
 - `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json)
 - `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5)
 - `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0)
 - `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false)
 - `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames
 - `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames
 - `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set)
 ## Supported Configurations
 - **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3
 - **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac
 - **Output formats**: vtt, srt, json, txt
 - **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
 ## REST API Endpoints
 The REST API server provides the following HTTP endpoints:
 ### GET /
 Returns API information and available endpoints.
 ### GET /health
 Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`.
 ### GET /models
 Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
 ### POST /transcribe
 Transcribe a single audio file that exists on the server.
 **Request Body:**
 ```json
 {
  "audio_path": "/path/to/audio.mp3",
  "model_name": "large-v3",
  "device": "auto",
  "compute_type": "auto",
  "language": "en",
  "output_format": "txt",
  "beam_size": 5,
  "temperature": 0.0,
  "initial_prompt": null,
  "output_directory": null
 }
 ```
 **Response:**
 ```json
 {
  "success": true,
  "message": "Transcription successful, results saved to: /path/to/output.txt",
  "output_path": "/path/to/output.txt"
 }
 ```
 ### POST /batch-transcribe
 Batch transcribe all audio files in a folder.
 **Request Body:**
 ```json
 {
  "audio_folder": "/path/to/audio/folder",
  "output_folder": "/path/to/output",
  "model_name": "large-v3",
  "output_format": "txt",
  ...
 }
 ```
 **Response:**
 ```json
 {
  "success": true,
  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
 }
 ```
 ### POST /upload-transcribe
 Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
 **Form Data:**
 - `file`: Audio file (multipart/form-data)
 - `model_name`: Model name (default: "large-v3")
 - `device`: Device (default: "auto")
 - `output_format`: Output format (default: "txt")
 - ... (other transcription parameters)
 **Response:** Returns the transcription file for download.
 ### API Usage Examples
 ```bash
 # Get model information
 curl http://localhost:8000/models
 # Transcribe existing file (synchronous)
 curl -X POST http://localhost:8000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
 # Upload and transcribe
 curl -X POST http://localhost:8000/upload-transcribe \
  -F "file=@audio.mp3" \
  -F "output_format=txt" \
  -F "model_name=large-v3"
 # Async job queue (if enabled)
 # Submit job
 curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"audio_path": "/path/to/audio.mp3"}'
 # Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}
 # Check status
 curl http://localhost:8000/jobs/abc-123
 # Returns: {"status": "running", ...}
 # Get result (when completed)
 curl http://localhost:8000/jobs/abc-123/result
 # Returns: transcription text
 # Check GPU health
 curl http://localhost:8000/health/gpu
 # Returns: {"gpu_available": true, "gpu_working": true, ...}
 ```
 ## GPU Auto-Reset Configuration
 ### Overview
 This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is **enabled by default** and includes cooldown protection to prevent reset loops.
 ### How It Works
 1. **Startup Check**: When the service starts, it performs a GPU health check
   - If CUDA errors detected → automatic reset attempt → retry
   - If retry fails → service terminates
 2. **Runtime Check**: Before job submission and model loading
   - If CUDA errors detected → automatic reset attempt → retry
   - If retry fails → job rejected, service continues
 3. **Cooldown Protection**: Prevents reset loops
   - Minimum 5 minutes between reset attempts (configurable via `GPU_RESET_COOLDOWN_MINUTES`)
   - Cooldown persists across restarts (stored in `/tmp/whisper-gpu-last-reset`)
   - If reset needed but cooldown active → service/job fails immediately
 ### Manual GPU Reset
 You can manually reset the GPU anytime:
 ```bash
 ./reset_gpu.sh
 ```
 Or clear the cooldown to allow immediate reset:
 ```python
 from core.gpu_reset import clear_reset_cooldown
 clear_reset_cooldown()
 ```
 ### Behavior Examples
 **After sleep/wake with GPU issue:**
 ```
 Service starts → GPU check fails (CUDA error)
 → Cooldown OK → Reset drivers → Wait 3s → Retry
 → Success → Service continues
 ```
 **Multiple failures (hardware issue):**
 ```
 First failure → Reset → Retry fails → Job fails
 Second failure within 5 min → Cooldown active → Fail immediately
 (Prevents reset loop)
 ```
 **Normal operation:**
 ```
 No CUDA errors → No resets → Normal performance
 Reset only happens on actual CUDA failures
 ```
 ## Important Implementation Details
-### GPU-Only Architecture
+**GPU-Only Architecture**:
- **CRITICAL**: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
+- All `device="auto"` resolution checks `torch.cuda.is_available()` and raises error if False
- `device="auto"` requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
+- No silent fallback to CPU anywhere in the codebase
- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
+- Health checks verify model actually ran on GPU (check `torch.cuda.memory_allocated`)
 - If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
 - **GPU Auto-Reset**: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)
-### Model Management
+**Thread Safety**:
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
+- `JobQueue` uses `threading.Lock` for job dictionary access
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
+- Worker thread processes jobs from `queue.Queue` (thread-safe FIFO)
- Models are cached globally in `model_instances` dict, shared across requests
+- `HealthMonitor` runs in separate daemon thread
 - Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)
-### Transcription Settings
+**Error Handling**:
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
+- Circuit breaker prevents retry storms on GPU failures
- Word timestamps are enabled by default (src/core/transcriber.py:107)
+- Input validation rejects invalid audio files, model names, languages
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
+- Job errors are captured and stored in job metadata with status=FAILED
 - Default output format is "txt" for REST API, configured via environment variables for MCP server
-### Async Job Queue (if enabled)
+**Shutdown Handling**:
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
+- `cleanup_on_shutdown()` waits for current job to complete
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
+- Stops health monitor thread
- Queue has max size limit (default 100), returns 503 when full
+- Saves final job states to disk
 - Job status polling recommended every 5-10 seconds for LLM agents
-## Development Workflow
+## Common Development Tasks
-### Running Tests
+**Adding a new output format**:
 1. Add formatter function in `src/utils/formatters.py`
 2. Add case in `transcribe_audio()` in `src/core/transcriber.py`
 3. Update API docs and MCP tool descriptions
-The test suite requires GPU access. Ensure CUDA is properly configured before running tests.
+**Adjusting GPU health check behavior**:
 1. Modify circuit breaker params in `src/core/gpu_health.py`
 2. Adjust health check interval in environment variables
 3. Consider cooldown timing in `src/core/gpu_reset.py`
-```bash
+**Testing GPU reset logic**:
-# Set PYTHONPATH to include src directory
+1. Manually trigger GPU failure (e.g., occupy all GPU memory)
-export PYTHONPATH="$(pwd)/src:$PYTHONPATH"
+2. Watch logs for circuit breaker state transitions
 3. Verify reset attempt with cooldown enforcement
 4. Check `nvidia-smi` before/after reset
-# Run core component tests (GPU health, job queue, audio validation)
+**Debugging job queue issues**:
-python tests/test_core_components.py
+1. Check job metadata files in `JOB_METADATA_DIR`
-
+2. Look for lock contention in logs
-# Run end-to-end integration tests
+3. Verify worker thread is running (check logs for "Job queue worker started")
-python tests/test_e2e_integration.py
+4. Test with `JOB_QUEUE_MAX_SIZE=1` to isolate serialization
 # Run async API integration tests
 python tests/test_async_api_integration.py
 ```
 Tests will automatically:
 - Check for GPU availability (exit if not available)
 - Validate audio file processing
 - Test GPU health monitoring
 - Test job queue operations
 - Test transcription pipeline
 ### Testing GPU Health
 ```python
 # Test GPU health check manually
 from src.core.gpu_health import check_gpu_health
 status = check_gpu_health(expected_device="cuda")
 print(f"GPU Working: {status.gpu_working}")
 print(f"Device: {status.device_used}")
 print(f"Test Duration: {status.test_duration_seconds}s")
 # Expected: <1s for GPU, 3-10s for CPU
 ```
 ### Testing Job Queue
 ```python
 # Test job queue manually
 from src.core.job_queue import JobQueue
 queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
 queue.start()
 # Submit job
 job_info = queue.submit_job(
    audio_path="/path/to/test.mp3",
    model_name="large-v3",
    device="cuda"
 )
 print(f"Job ID: {job_info['job_id']}")
 # Poll status
 status = queue.get_job_status(job_info['job_id'])
 print(f"Status: {status['status']}")
 # Get result when completed
 result = queue.get_job_result(job_info['job_id'])
 ```
 ### Common Debugging
 **Model loading issues:**
 - Check `WHISPER_MODEL_DIR` is set correctly
 - Verify GPU memory with `nvidia-smi`
 - Check logs for GPU driver test failures at model_manager.py:112-114
 **GPU not detected:**
 - Verify `CUDA_VISIBLE_DEVICES` is set correctly
 - Check `torch.cuda.is_available()` returns True
 - Run GPU health check to see detailed error
 **Silent failures:**
 - Check that service is NOT silently falling back to CPU
 - GPU health check should RAISE errors, not log warnings
 - If device=cuda fails, the job should be rejected, not processed on CPU
 **Job queue issues:**
 - Check `JOB_METADATA_DIR` exists and is writable
 - Verify background worker thread is running (check logs)
 - Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json
 ### File Locations
 - **Logs**: `mcp.logs` (MCP server), `api.logs` (API server)
 - **Models**: `$WHISPER_MODEL_DIR` or HuggingFace cache
 - **Outputs**: `$TRANSCRIPTION_OUTPUT_DIR` or `$TRANSCRIPTION_BATCH_OUTPUT_DIR`
 - **Job Metadata**: `$JOB_METADATA_DIR/{job_id}.json`
 ### Important Development Notes
 - See `DEV_PLAN.md` for detailed architecture and implementation plan for async job queue features
 - The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
 - When modifying model_manager.py, maintain the strict GPU-only enforcement
 - When adding new endpoints, follow the async pattern if transcription time >30 seconds