fix gpu check at startupissue

update claude md
Update documentation and configuration
2025-10-12 03:09:04 +03:00 · 2025-10-10 01:49:48 +03:00 · 2025-10-10 01:22:41 +03:00
9 changed files with 283 additions and 645 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -17,3 +17,6 @@ venv/
 logs/**
 User/**
 data/**
+models/*
+outputs/*
+api.logs
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -2,528 +2,213 @@

 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

-## Overview
+## Project Overview

-This is a Whisper-based speech recognition service that provides high-performance audio transcription using Faster Whisper. The service runs as either:
+**fast-whisper-mcp-server** is a high-performance audio transcription service built on faster-whisper with dual-server architecture:
+- **MCP Server** (`whisper_server.py`): Model Context Protocol interface for LLM integration
+- **REST API Server** (`api_server.py`): HTTP REST endpoints with FastAPI

-1. **MCP Server** - For integration with Claude Desktop and other MCP clients
-2. **REST API Server** - For HTTP-based integrations with async job queue support
+The service features async job queue processing, GPU health monitoring with auto-reset, circuit breaker patterns, and comprehensive error handling. **GPU is required** - there is no CPU fallback.

-Both servers share the same core transcription logic and can run independently or simultaneously on different ports.
+## Core Commands

-**Key Features:**
- Async job queue system for long-running transcriptions (prevents HTTP timeouts)
- GPU health monitoring with strict failure detection (prevents silent CPU fallback)
- **Automatic GPU driver reset** on CUDA errors with cooldown protection (handles sleep/wake issues)
- Dual-server architecture (MCP + REST API)
- Model caching for fast repeated transcriptions
- Automatic batch size optimization based on GPU memory
-
-## Development Commands
-
-### Environment Setup
+### Running Servers
 ```bash
-# Create and activate virtual environment
+# MCP Server (for LLM integration via MCP)
+./run_mcp_server.sh
+
+# REST API Server (for HTTP clients)
+./run_api_server.sh
+
+# Both servers log to mcp.logs and api.logs respectively
+```
+
+### Testing
+```bash
+# Run core component tests (GPU health, job queue, validation)
+python tests/test_core_components.py
+
+# Run async API integration tests
+python tests/test_async_api_integration.py
+
+# Run end-to-end integration tests
+python tests/test_e2e_integration.py
+```
+
+### GPU Management
+```bash
+# Reset GPU drivers without rebooting (requires sudo)
+./reset_gpu.sh
+
+# Check GPU status
+nvidia-smi
+
+# Monitor GPU during transcription
+watch -n 1 nvidia-smi
+```
+
+### Installation
+```bash
+# Create virtual environment
 python3.12 -m venv venv
 source venv/bin/activate

-# Install dependencies
+# Install dependencies (check requirements.txt for CUDA-specific instructions)
 pip install -r requirements.txt

-# Install PyTorch with CUDA 12.6 support
-pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
-
-# For CUDA 12.1
-pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
-
-# For CPU-only
-pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
-```
-
-### Running the Servers
-
-#### MCP Server (for Claude Desktop)
-
-```bash
-# Using the startup script (recommended - sets all env vars)
-./run_mcp_server.sh
-
-# Direct Python execution
-python whisper_server.py
-
-# Using MCP CLI for development testing
-mcp dev whisper_server.py
-
-# Run server with MCP CLI
-mcp run whisper_server.py
-```
-
-#### REST API Server (for HTTP clients)
-
-```bash
-# Using the startup script (recommended - sets all env vars)
-./run_api_server.sh
-
-# Direct Python execution with uvicorn
-python api_server.py
-
-# Or using uvicorn directly
-uvicorn api_server:app --host 0.0.0.0 --port 8000
-
-# Development mode with auto-reload
-uvicorn api_server:app --reload --host 0.0.0.0 --port 8000
-```
-
-#### Running Both Simultaneously
-
-```bash
-# Terminal 1: Start MCP server
-./run_mcp_server.sh
-
-# Terminal 2: Start REST API server
-./run_api_server.sh
-```
-
-### Docker
-
-```bash
-# Build Docker image
-docker build -t whisper-mcp-server .
-
-# Run with GPU support
-docker run --gpus all -v /path/to/models:/models -v /path/to/outputs:/outputs whisper-mcp-server
+# For CUDA 12.4:
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
 ```

 ## Architecture

 ### Directory Structure
-
 ```
-.
-├── src/                          # Source code directory
-│   ├── servers/                  # Server implementations
-│   │   ├── whisper_server.py    # MCP server entry point
-│   │   └── api_server.py        # REST API server (async job queue)
-│   ├── core/                     # Core business logic
-│   │   ├── transcriber.py       # Transcription logic (single & batch)
-│   │   ├── model_manager.py     # Model lifecycle & caching
-│   │   ├── job_queue.py         # Async job queue manager
-│   │   └── gpu_health.py        # GPU health monitoring
-│   └── utils/                    # Utility modules
-│       ├── audio_processor.py   # Audio validation & preprocessing
-│       ├── formatters.py        # Output format conversion
-│       └── test_audio_generator.py # Test audio generation for GPU checks
-├── run_mcp_server.sh            # MCP server startup script
-├── run_api_server.sh            # API server startup script
-├── reset_gpu.sh                 # GPU driver reset script
-├── DEV_PLAN.md                  # Development plan for async features
-├── requirements.txt              # Python dependencies
-└── pyproject.toml               # Project configuration
+src/
+├── core/              # Core business logic
+│   ├── transcriber.py     # Main transcription logic with env var defaults
+│   ├── model_manager.py   # Whisper model loading/caching (GPU-only)
+│   ├── job_queue.py       # Async FIFO job queue with worker thread
+│   ├── gpu_health.py      # Real GPU health checks with circuit breaker
+│   └── gpu_reset.py       # Automatic GPU driver reset logic
+├── servers/           # Server implementations
+│   ├── whisper_server.py  # MCP server (stdio transport)
+│   └── api_server.py      # FastAPI REST server
+└── utils/             # Utilities
+    ├── startup.py         # Common startup sequence (GPU check, initialization)
+    ├── circuit_breaker.py # Circuit breaker pattern implementation
+    ├── audio_processor.py # Audio file validation
+    ├── formatters.py      # Output format handlers (txt, vtt, srt, json)
+    ├── input_validation.py# Input validation utilities
+    └── test_audio_generator.py  # Generate test audio for health checks
 ```

-### Core Components
+### Key Architectural Patterns

-1. **src/servers/whisper_server.py** - MCP server entry point
-   - Uses FastMCP framework to expose MCP tools
-   - Three main tools: `get_model_info_api()`, `transcribe()`, `batch_transcribe_audio()`
-   - Server initialization at line 19
+**Async Job Queue** (`job_queue.py`):
+- FIFO queue with background worker thread
+- Disk persistence of job metadata to `JOB_METADATA_DIR`
+- States: QUEUED → RUNNING → COMPLETED/FAILED
+- Jobs include full request params + results
+- Thread-safe operations with locks

-2. **src/servers/api_server.py** - REST API server entry point
-   - Uses FastAPI framework for HTTP endpoints
-   - Provides REST endpoints: `/`, `/health`, `/models`, `/transcribe`, `/batch-transcribe`, `/upload-transcribe`
-   - Shares core transcription logic with MCP server
-   - File upload support via multipart/form-data
+**GPU Health Monitoring** (`gpu_health.py`):
+- Performs **real** GPU checks: loads tiny model + transcribes test audio
+- Circuit breaker prevents repeated failures (3 failures → open, 60s timeout)
+- Integration with GPU auto-reset on failures
+- Background monitoring thread in `HealthMonitor` class
+- Never falls back to CPU - raises RuntimeError if GPU unavailable

-3. **src/core/transcriber.py** - Core transcription logic (shared by both servers)
-   - `transcribe_audio()`:39 - Single file transcription with environment variable support
-   - `batch_transcribe()`:209 - Batch processing with progress reporting
-   - All parameters support environment variable defaults (lines 21-37)
-   - Delegates output formatting to utils.formatters
+**GPU Auto-Reset** (`gpu_reset.py`):
+- Automatically resets GPU drivers via `reset_gpu.sh` when health checks fail
+- Cooldown mechanism (default 5 min via `GPU_RESET_COOLDOWN_MINUTES`)
+- Sudo required - script unloads/reloads nvidia kernel modules
+- Integrated with circuit breaker to avoid reset loops

-4. **src/core/model_manager.py** - Whisper model lifecycle management
-   - `get_whisper_model()`:44 - Returns cached model instances or loads new ones
-   - `test_gpu_driver()`:20 - GPU validation before model loading
-   - **CRITICAL**: GPU-only mode enforced at lines 64-90 (no CPU fallback)
-   - Global `model_instances` dict caches loaded models to prevent reloading
-   - Automatic batch size optimization based on GPU memory (lines 134-147)
+**Startup Sequence** (`startup.py`):
+- Common startup logic for both servers
+- Phase 1: GPU health check with optional auto-reset
+- Phase 2: Initialize job queue
+- Phase 3: Initialize health monitor (background thread)
+- Exits on GPU failure unless configured otherwise

-5. **src/core/job_queue.py** - Async job queue manager
-   - `JobQueue` class manages FIFO queue with background worker thread
-   - `submit_job()` - Validates audio, checks GPU health, adds to queue
-   - `get_job_status()` - Returns current job status and queue position
-   - `get_job_result()` - Returns transcription result for completed jobs
-   - Jobs persist to disk as JSON files for crash recovery
-   - Single worker thread processes jobs sequentially (prevents GPU contention)
+**Circuit Breaker** (`circuit_breaker.py`):
+- States: CLOSED → OPEN → HALF_OPEN → CLOSED
+- Configurable failure/success thresholds
+- Prevents cascading failures and resource exhaustion
+- Used for GPU health checks and model operations

-6. **src/core/gpu_health.py** - GPU health monitoring
-   - `check_gpu_health()`:39 - Real GPU test using tiny model + test audio
-   - `GPUHealthStatus` dataclass contains detailed GPU metrics
-   - **CRITICAL**: Raises RuntimeError if device=cuda but GPU fails (lines 99-135)
-   - Prevents silent CPU fallback that would cause 10-100x slowdown
-   - `HealthMonitor` class for periodic background monitoring
+### Environment Variables

-7. **src/utils/audio_processor.py** - Audio file validation and preprocessing
-   - `validate_audio_file()`:15 - Checks file existence, format, and size
-   - `process_audio()`:50 - Decodes audio using faster_whisper's decode_audio
+Both server scripts set extensive environment variables. Key ones:

-8. **src/utils/formatters.py** - Output format conversion
-   - `format_vtt()`, `format_srt()`, `format_txt()`, `format_json()` - Convert segments to various formats
-   - All formatters accept segment lists from Whisper output
+**GPU/CUDA**:
+- `CUDA_VISIBLE_DEVICES`: GPU index (default: 1)
+- `LD_LIBRARY_PATH`: CUDA library path
+- `TRANSCRIPTION_DEVICE`: "cuda" or "auto" (never "cpu")
+- `TRANSCRIPTION_COMPUTE_TYPE`: "float16", "int8", or "auto"

-9. **src/utils/test_audio_generator.py** - Test audio generation
-   - `generate_test_audio()` - Creates synthetic 1-second audio for GPU health checks
-   - Uses numpy to generate sine wave, no external audio files needed
+**Paths**:
+- `WHISPER_MODEL_DIR`: Where Whisper models are cached
+- `TRANSCRIPTION_OUTPUT_DIR`: Transcription output directory
+- `TRANSCRIPTION_BATCH_OUTPUT_DIR`: Batch output directory
+- `JOB_METADATA_DIR`: Job metadata persistence directory

-### Key Architecture Patterns
+**Transcription Defaults**:
+- `TRANSCRIPTION_MODEL`: Model name (default: "large-v3")
+- `TRANSCRIPTION_OUTPUT_FORMAT`: "txt", "vtt", "srt", or "json"
+- `TRANSCRIPTION_BEAM_SIZE`: Beam search size (default: 5 for API, 2 for MCP)
+- `TRANSCRIPTION_TEMPERATURE`: Sampling temperature (default: 0.0)

- **Dual Server Architecture**: Both MCP and REST API servers import and use the same core modules (core.transcriber, core.model_manager, utils.audio_processor, utils.formatters), ensuring consistent behavior
- **Model Caching**: Models are cached in `model_instances` dictionary with key format `{model_name}_{device}_{compute_type}` (src/core/model_manager.py:104). This cache is shared if both servers run in the same process
- **Batch Processing**: CUDA devices automatically use BatchedInferencePipeline for performance (src/core/model_manager.py:132-160)
- **Environment Variable Configuration**: All transcription parameters support env var defaults (src/core/transcriber.py:21-37)
- **GPU-Only Mode**: Service is configured for GPU-only operation. `device="auto"` requires CUDA, `device="cpu"` is rejected (src/core/model_manager.py:64-90)
- **Async Job Queue**: Long-running transcriptions use async queue pattern to prevent HTTP timeouts. Jobs return immediately with job_id for polling
- **GPU Health Monitoring**: Real GPU tests with tiny model prevent silent CPU fallback. Jobs are rejected immediately if GPU fails rather than running 10-100x slower on CPU
+**Job Queue**:
+- `JOB_QUEUE_MAX_SIZE`: Max queued jobs (default: 100 for MCP, 5 for API)
+- `JOB_RETENTION_DAYS`: How long to keep job metadata (default: 7)

-## Environment Variables
+**Health Monitoring**:
+- `GPU_HEALTH_CHECK_ENABLED`: Enable background monitoring (default: true)
+- `GPU_HEALTH_CHECK_INTERVAL_MINUTES`: Check interval (default: 10)
+- `GPU_HEALTH_TEST_MODEL`: Model for health checks (default: "tiny")
+- `GPU_RESET_COOLDOWN_MINUTES`: Cooldown between reset attempts (default: 5)

-All configuration can be set via environment variables in run_mcp_server.sh and run_api_server.sh:
+### API Workflow (Async Jobs)

-**API Server Specific:**
- `API_HOST` - API server host (default: 0.0.0.0)
- `API_PORT` - API server port (default: 8000)
+Both MCP and REST API use the same async workflow:

-**Job Queue Configuration (if using async features):**
- `JOB_QUEUE_MAX_SIZE` - Maximum queue size (default: 100)
- `JOB_METADATA_DIR` - Directory for job metadata JSON files
- `JOB_RETENTION_DAYS` - Auto-cleanup old jobs (0=disabled)
+1. **Submit job**: `transcribe_async()` returns `job_id` immediately
+2. **Poll status**: `get_job_status(job_id)` returns status + queue_position
+3. **Get result**: When status="completed", `get_job_result(job_id)` returns transcription

-**GPU Health Monitoring:**
- `GPU_HEALTH_CHECK_ENABLED` - Enable periodic GPU monitoring (true/false)
- `GPU_HEALTH_CHECK_INTERVAL_MINUTES` - Monitoring interval (default: 10)
- `GPU_HEALTH_TEST_MODEL` - Model for health checks (default: tiny)
+The job queue processes one job at a time in a background worker thread.

-**GPU Auto-Reset Configuration:**
- `GPU_RESET_COOLDOWN_MINUTES` - Minimum time between GPU reset attempts (default: 5 minutes)
-  - Prevents reset loops while allowing recovery from sleep/wake cycles
-  - Auto-reset is **enabled by default**
-  - Service terminates if GPU unavailable after reset attempt
+### Model Loading Strategy

-**Transcription Configuration (shared by both servers):**
-
- `CUDA_VISIBLE_DEVICES` - GPU device selection
- `WHISPER_MODEL_DIR` - Model storage location (defaults to None for HuggingFace cache)
- `TRANSCRIPTION_OUTPUT_DIR` - Default output directory for single transcriptions
- `TRANSCRIPTION_BATCH_OUTPUT_DIR` - Default output directory for batch processing
- `TRANSCRIPTION_MODEL` - Model size (tiny, base, small, medium, large-v1, large-v2, large-v3)
- `TRANSCRIPTION_DEVICE` - Execution device (cuda, auto) - **NOTE: cpu is rejected in GPU-only mode**
- `TRANSCRIPTION_COMPUTE_TYPE` - Computation type (float16, int8, auto)
- `TRANSCRIPTION_OUTPUT_FORMAT` - Output format (vtt, srt, txt, json)
- `TRANSCRIPTION_BEAM_SIZE` - Beam search size (default: 5)
- `TRANSCRIPTION_TEMPERATURE` - Sampling temperature (default: 0.0)
- `TRANSCRIPTION_USE_TIMESTAMP` - Add timestamp to filenames (true/false)
- `TRANSCRIPTION_FILENAME_PREFIX` - Prefix for output filenames
- `TRANSCRIPTION_FILENAME_SUFFIX` - Suffix for output filenames
- `TRANSCRIPTION_LANGUAGE` - Language code (zh, en, ja, etc., auto-detect if not set)
-
-## Supported Configurations
-
- **Models**: tiny, base, small, medium, large-v1, large-v2, large-v3
- **Audio formats**: .mp3, .wav, .m4a, .flac, .ogg, .aac
- **Output formats**: vtt, srt, json, txt
- **Languages**: zh (Chinese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), ru (Russian), it (Italian), pt (Portuguese), nl (Dutch), ar (Arabic), hi (Hindi), tr (Turkish), vi (Vietnamese), th (Thai), id (Indonesian)
-
-## REST API Endpoints
-
-The REST API server provides the following HTTP endpoints:
-
-### GET /
-Returns API information and available endpoints.
-
-### GET /health
-Health check endpoint. Returns `{"status": "healthy", "service": "whisper-transcription"}`.
-
-### GET /models
-Returns available Whisper models, devices, languages, and system information (GPU details if CUDA available).
-
-### POST /transcribe
-Transcribe a single audio file that exists on the server.
-
-**Request Body:**
-```json
-{
-  "audio_path": "/path/to/audio.mp3",
-  "model_name": "large-v3",
-  "device": "auto",
-  "compute_type": "auto",
-  "language": "en",
-  "output_format": "txt",
-  "beam_size": 5,
-  "temperature": 0.0,
-  "initial_prompt": null,
-  "output_directory": null
-}
-```
-
-**Response:**
-```json
-{
-  "success": true,
-  "message": "Transcription successful, results saved to: /path/to/output.txt",
-  "output_path": "/path/to/output.txt"
-}
-```
-
-### POST /batch-transcribe
-Batch transcribe all audio files in a folder.
-
-**Request Body:**
-```json
-{
-  "audio_folder": "/path/to/audio/folder",
-  "output_folder": "/path/to/output",
-  "model_name": "large-v3",
-  "output_format": "txt",
-  ...
-}
-```
-
-**Response:**
-```json
-{
-  "success": true,
-  "summary": "Batch processing completed, total transcription time: 00:05:23 | Success: 10/10 | Failed: 0/10"
-}
-```
-
-### POST /upload-transcribe
-Upload an audio file and transcribe it immediately. Returns the transcription file as a download.
-
-**Form Data:**
- `file`: Audio file (multipart/form-data)
- `model_name`: Model name (default: "large-v3")
- `device`: Device (default: "auto")
- `output_format`: Output format (default: "txt")
- ... (other transcription parameters)
-
-**Response:** Returns the transcription file for download.
-
-### API Usage Examples
-
-```bash
-# Get model information
-curl http://localhost:8000/models
-
-# Transcribe existing file (synchronous)
-curl -X POST http://localhost:8000/transcribe \
-  -H "Content-Type: application/json" \
-  -d '{"audio_path": "/path/to/audio.mp3", "output_format": "txt"}'
-
-# Upload and transcribe
-curl -X POST http://localhost:8000/upload-transcribe \
-  -F "file=@audio.mp3" \
-  -F "output_format=txt" \
-  -F "model_name=large-v3"
-
-# Async job queue (if enabled)
-# Submit job
-curl -X POST http://localhost:8000/jobs \
-  -H "Content-Type: application/json" \
-  -d '{"audio_path": "/path/to/audio.mp3"}'
-# Returns: {"job_id": "abc-123", "status": "queued", "queue_position": 1}
-
-# Check status
-curl http://localhost:8000/jobs/abc-123
-# Returns: {"status": "running", ...}
-
-# Get result (when completed)
-curl http://localhost:8000/jobs/abc-123/result
-# Returns: transcription text
-
-# Check GPU health
-curl http://localhost:8000/health/gpu
-# Returns: {"gpu_available": true, "gpu_working": true, ...}
-```
-
-## GPU Auto-Reset Configuration
-
-### Overview
-This service features automatic GPU driver reset on CUDA errors, which is especially useful for recovering from sleep/wake cycles. The reset functionality is **enabled by default** and includes cooldown protection to prevent reset loops.
-
-### Passwordless Sudo Setup (Required)
-
-For automatic GPU reset to work, you must configure passwordless sudo for NVIDIA commands. Create a sudoers configuration file:
-
-```bash
-sudo visudo -f /etc/sudoers.d/whisper-gpu-reset
-```
-
-Add the following (replace `your_username` with your actual username):
-
-```
-# Whisper GPU Auto-Reset Permissions
-your_username ALL=(ALL) NOPASSWD: /bin/systemctl stop nvidia-persistenced
-your_username ALL=(ALL) NOPASSWD: /bin/systemctl start nvidia-persistenced
-your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_uvm
-your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_drm
-your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia_modeset
-your_username ALL=(ALL) NOPASSWD: /sbin/rmmod nvidia
-your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia
-your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_modeset
-your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_uvm
-your_username ALL=(ALL) NOPASSWD: /sbin/modprobe nvidia_drm
-```
-
-**Security Note:** These permissions are limited to specific NVIDIA driver commands only. The reset script (`reset_gpu.sh`) is executed with sudo but is part of the codebase and can be audited.
-
-### How It Works
-
-1. **Startup Check**: When the service starts, it performs a GPU health check
-   - If CUDA errors detected → automatic reset attempt → retry
-   - If retry fails → service terminates
-
-2. **Runtime Check**: Before job submission and model loading
-   - If CUDA errors detected → automatic reset attempt → retry
-   - If retry fails → job rejected, service continues
-
-3. **Cooldown Protection**: Prevents reset loops
-   - Minimum 5 minutes between reset attempts (configurable via `GPU_RESET_COOLDOWN_MINUTES`)
-   - Cooldown persists across restarts (stored in `/tmp/whisper-gpu-last-reset`)
-   - If reset needed but cooldown active → service/job fails immediately
-
-### Manual GPU Reset
-
-You can manually reset the GPU anytime:
-
-```bash
-./reset_gpu.sh
-```
-
-Or clear the cooldown to allow immediate reset:
-
-```python
-from core.gpu_reset import clear_reset_cooldown
-clear_reset_cooldown()
-```
-
-### Behavior Examples
-
-**After sleep/wake with GPU issue:**
-```
-Service starts → GPU check fails (CUDA error)
-→ Cooldown OK → Reset drivers → Wait 3s → Retry
-→ Success → Service continues
-```
-
-**Multiple failures (hardware issue):**
-```
-First failure → Reset → Retry fails → Job fails
-Second failure within 5 min → Cooldown active → Fail immediately
-(Prevents reset loop)
-```
-
-**Normal operation:**
-```
-No CUDA errors → No resets → Normal performance
-Reset only happens on actual CUDA failures
-```
+- Models are cached in `model_instances` dict (key: model_name + device + compute_type)
+- First load downloads model to `WHISPER_MODEL_DIR` (or default cache)
+- GPU health check on model load - may trigger auto-reset if GPU fails
+- No CPU fallback - raises `RuntimeError` if CUDA unavailable

 ## Important Implementation Details

-### GPU-Only Architecture
- **CRITICAL**: Service enforces GPU-only mode. CPU device is explicitly rejected (src/core/model_manager.py:84-90)
- `device="auto"` requires CUDA to be available, raises RuntimeError if not (src/core/model_manager.py:64-73)
- GPU health checks use real model loading + transcription, not just torch.cuda.is_available()
- If GPU health check fails, jobs are rejected immediately rather than silently falling back to CPU
- **GPU Auto-Reset**: Automatic driver reset on CUDA errors with 5-minute cooldown (handles sleep/wake issues)
+**GPU-Only Architecture**:
+- All `device="auto"` resolution checks `torch.cuda.is_available()` and raises error if False
+- No silent fallback to CPU anywhere in the codebase
+- Health checks verify model actually ran on GPU (check `torch.cuda.memory_allocated`)

-### Model Management
- GPU memory is checked before loading models (src/core/model_manager.py:115-127)
- Batch size dynamically adjusts: 32 (>16GB), 16 (>12GB), 8 (>8GB), 4 (>4GB), 2 (otherwise)
- Models are cached globally in `model_instances` dict, shared across requests
- Model loading includes GPU driver test to fail fast if GPU is unavailable (src/core/model_manager.py:112-114)
+**Thread Safety**:
+- `JobQueue` uses `threading.Lock` for job dictionary access
+- Worker thread processes jobs from `queue.Queue` (thread-safe FIFO)
+- `HealthMonitor` runs in separate daemon thread

-### Transcription Settings
- VAD (Voice Activity Detection) is enabled by default for better long-audio accuracy (src/core/transcriber.py:102)
- Word timestamps are enabled by default (src/core/transcriber.py:107)
- Files over 1GB generate warnings about processing time (src/utils/audio_processor.py:42)
- Default output format is "txt" for REST API, configured via environment variables for MCP server
+**Error Handling**:
+- Circuit breaker prevents retry storms on GPU failures
+- Input validation rejects invalid audio files, model names, languages
+- Job errors are captured and stored in job metadata with status=FAILED

-### Async Job Queue (if enabled)
- Single worker thread processes jobs sequentially (prevents GPU memory contention)
- Jobs persist to disk as JSON files in JOB_METADATA_DIR
- Queue has max size limit (default 100), returns 503 when full
- Job status polling recommended every 5-10 seconds for LLM agents
+**Shutdown Handling**:
+- `cleanup_on_shutdown()` waits for current job to complete
+- Stops health monitor thread
+- Saves final job states to disk

-## Development Workflow
+## Common Development Tasks

-### Testing GPU Health
-```python
-# Test GPU health check manually
-from src.core.gpu_health import check_gpu_health
+**Adding a new output format**:
+1. Add formatter function in `src/utils/formatters.py`
+2. Add case in `transcribe_audio()` in `src/core/transcriber.py`
+3. Update API docs and MCP tool descriptions

-status = check_gpu_health(expected_device="cuda")
-print(f"GPU Working: {status.gpu_working}")
-print(f"Device: {status.device_used}")
-print(f"Test Duration: {status.test_duration_seconds}s")
-# Expected: <1s for GPU, 3-10s for CPU
-```
+**Adjusting GPU health check behavior**:
+1. Modify circuit breaker params in `src/core/gpu_health.py`
+2. Adjust health check interval in environment variables
+3. Consider cooldown timing in `src/core/gpu_reset.py`

-### Testing Job Queue
-```python
-# Test job queue manually
-from src.core.job_queue import JobQueue
+**Testing GPU reset logic**:
+1. Manually trigger GPU failure (e.g., occupy all GPU memory)
+2. Watch logs for circuit breaker state transitions
+3. Verify reset attempt with cooldown enforcement
+4. Check `nvidia-smi` before/after reset

-queue = JobQueue(max_queue_size=100, metadata_dir="/tmp/jobs")
-queue.start()
-
-# Submit job
-job_info = queue.submit_job(
-    audio_path="/path/to/test.mp3",
-    model_name="large-v3",
-    device="cuda"
-)
-print(f"Job ID: {job_info['job_id']}")
-
-# Poll status
-status = queue.get_job_status(job_info['job_id'])
-print(f"Status: {status['status']}")
-
-# Get result when completed
-result = queue.get_job_result(job_info['job_id'])
-```
-
-### Common Debugging
-
-**Model loading issues:**
- Check `WHISPER_MODEL_DIR` is set correctly
- Verify GPU memory with `nvidia-smi`
- Check logs for GPU driver test failures at model_manager.py:112-114
-
-**GPU not detected:**
- Verify `CUDA_VISIBLE_DEVICES` is set correctly
- Check `torch.cuda.is_available()` returns True
- Run GPU health check to see detailed error
-
-**Silent failures:**
- Check that service is NOT silently falling back to CPU
- GPU health check should RAISE errors, not log warnings
- If device=cuda fails, the job should be rejected, not processed on CPU
-
-**Job queue issues:**
- Check `JOB_METADATA_DIR` exists and is writable
- Verify background worker thread is running (check logs)
- Job metadata files are in {JOB_METADATA_DIR}/{job_id}.json
-
-### File Locations
-
- **Logs**: `mcp.logs` (MCP server), `api.logs` (API server)
- **Models**: `$WHISPER_MODEL_DIR` or HuggingFace cache
- **Outputs**: `$TRANSCRIPTION_OUTPUT_DIR` or `$TRANSCRIPTION_BATCH_OUTPUT_DIR`
- **Job Metadata**: `$JOB_METADATA_DIR/{job_id}.json`
-
-### Important Development Notes
-
- See `DEV_PLAN.md` for detailed architecture and implementation plan for async job queue features
- The service is designed for GPU-only operation - CPU fallback is intentionally disabled to prevent silent performance degradation
- When modifying model_manager.py, maintain the strict GPU-only enforcement
- When adding new endpoints, follow the async pattern if transcription time >30 seconds
+**Debugging job queue issues**:
+1. Check job metadata files in `JOB_METADATA_DIR`
+2. Look for lock contention in logs
+3. Verify worker thread is running (check logs for "Job queue worker started")
+4. Test with `JOB_QUEUE_MAX_SIZE=1` to isolate serialization
--- a/api.logs
+++ b/api.logs
@@ -1,85 +0,0 @@
-INFO:__main__:======================================================================
-INFO:__main__:PERFORMING STARTUP GPU HEALTH CHECK
-INFO:__main__:======================================================================
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 1.04s
-INFO:__main__:======================================================================
-INFO:__main__:STARTUP GPU CHECK SUCCESSFUL
-INFO:__main__:GPU Device: NVIDIA GeForce RTX 3060
-INFO:__main__:Memory Available: 11.66 GB
-INFO:__main__:Test Duration: 1.04s
-INFO:__main__:======================================================================
-INFO:__main__:Starting Whisper REST API server on 0.0.0.0:8000
-INFO:     Started server process [69821]
-INFO:     Waiting for application startup.
-INFO:__main__:Starting job queue and health monitor...
-INFO:core.job_queue:Starting job queue (max size: 100)
-INFO:core.job_queue:Loading jobs from /media/raid/agents/tools/mcp-transcriptor/outputs/jobs
-INFO:core.job_queue:Loaded 8 jobs from disk
-INFO:core.job_queue:Job queue worker loop started
-INFO:core.job_queue:Job queue worker started
-INFO:__main__:Job queue started (max_size=100, metadata_dir=/media/raid/agents/tools/mcp-transcriptor/outputs/jobs)
-INFO:core.gpu_health:Starting GPU health monitor (interval: 10.0 minutes)
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.37s
-INFO:__main__:GPU health monitor started (interval=10 minutes)
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
-INFO:     127.0.0.1:48092 - "GET /jobs HTTP/1.1" 200 OK
-INFO:     127.0.0.1:60874 - "GET /jobs?status=completed&limit=3 HTTP/1.1" 200 OK
-INFO:     127.0.0.1:60876 - "GET /jobs?status=failed&limit=10 HTTP/1.1" 200 OK
-INFO:core.job_queue:Running GPU health check before job submission
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.39s
-INFO:core.job_queue:GPU health check passed
-INFO:core.job_queue:Job 6be8e49a-bdc1-4508-af99-280bef033cb0 submitted: /tmp/whisper_test_voice_1s.mp3 (queue position: 1)
-INFO:     127.0.0.1:58376 - "POST /jobs HTTP/1.1" 200 OK
-INFO:core.job_queue:Job 6be8e49a-bdc1-4508-af99-280bef033cb0 started processing
-INFO:core.model_manager:Running GPU health check with auto-reset before model loading
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.54s
-INFO:core.model_manager:Loading Whisper model: tiny device: cuda compute type: float16
-INFO:core.model_manager:Available GPU memory: 12.52 GB
-INFO:core.model_manager:Enabling batch processing acceleration, batch size: 16
-INFO:core.transcriber:Starting transcription of file: whisper_test_voice_1s.mp3
-INFO:utils.audio_processor:Successfully preprocessed audio: whisper_test_voice_1s.mp3
-INFO:core.transcriber:Using batch acceleration for transcription...
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:VAD filter removed 00:00.000 of audio
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.transcriber:Transcription completed, time used: 0.16 seconds, detected language: en, audio length: 1.51 seconds
-INFO:core.transcriber:Transcription results saved to: /media/raid/agents/tools/mcp-transcriptor/outputs/whisper_test_voice_1s.txt
-INFO:core.job_queue:Job 6be8e49a-bdc1-4508-af99-280bef033cb0 completed successfully: /media/raid/agents/tools/mcp-transcriptor/outputs/whisper_test_voice_1s.txt
-INFO:core.job_queue:Job 6be8e49a-bdc1-4508-af99-280bef033cb0 finished: status=completed, duration=1.1s
-INFO:     127.0.0.1:41646 - "GET /jobs/6be8e49a-bdc1-4508-af99-280bef033cb0 HTTP/1.1" 200 OK
-INFO:     127.0.0.1:34046 - "GET /jobs/6be8e49a-bdc1-4508-af99-280bef033cb0/result HTTP/1.1" 200 OK
-INFO:core.job_queue:Running GPU health check before job submission
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.39s
-INFO:core.job_queue:GPU health check passed
-INFO:core.job_queue:Job 41ce74c0-8929-457b-96b3-1b8e4a720a7a submitted: /home/uad/agents/tools/mcp-transcriptor/data/test.mp3 (queue position: 1)
-INFO:     127.0.0.1:44576 - "POST /jobs HTTP/1.1" 200 OK
-INFO:core.job_queue:Job 41ce74c0-8929-457b-96b3-1b8e4a720a7a started processing
-INFO:core.model_manager:Running GPU health check with auto-reset before model loading
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.39s
-INFO:core.model_manager:Loading Whisper model: large-v3 device: cuda compute type: float16
-INFO:core.model_manager:Available GPU memory: 12.52 GB
-INFO:core.model_manager:Enabling batch processing acceleration, batch size: 16
-INFO:core.transcriber:Starting transcription of file: test.mp3
-INFO:utils.audio_processor:Successfully preprocessed audio: test.mp3
-INFO:core.transcriber:Using batch acceleration for transcription...
-INFO:faster_whisper:Processing audio with duration 00:06.955
-INFO:faster_whisper:VAD filter removed 00:00.299 of audio
-INFO:core.transcriber:Transcription completed, time used: 0.52 seconds, detected language: en, audio length: 6.95 seconds
-INFO:core.transcriber:Transcription results saved to: /media/raid/agents/tools/mcp-transcriptor/outputs/test.txt
-INFO:core.job_queue:Job 41ce74c0-8929-457b-96b3-1b8e4a720a7a completed successfully: /media/raid/agents/tools/mcp-transcriptor/outputs/test.txt
-INFO:core.job_queue:Job 41ce74c0-8929-457b-96b3-1b8e4a720a7a finished: status=completed, duration=23.3s
-INFO:     127.0.0.1:59120 - "GET /jobs/41ce74c0-8929-457b-96b3-1b8e4a720a7a HTTP/1.1" 200 OK
-INFO:     127.0.0.1:53806 - "GET /jobs/41ce74c0-8929-457b-96b3-1b8e4a720a7a/result HTTP/1.1" 200 OK
--- a/mcp.logs
+++ b/mcp.logs
@@ -1,25 +0,0 @@
-starting mcp server for whisper stt transcriptor
-INFO:__main__:======================================================================
-INFO:__main__:PERFORMING STARTUP GPU HEALTH CHECK
-INFO:__main__:======================================================================
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.93s
-INFO:__main__:======================================================================
-INFO:__main__:STARTUP GPU CHECK SUCCESSFUL
-INFO:__main__:GPU Device: NVIDIA GeForce RTX 3060
-INFO:__main__:Memory Available: 11.66 GB
-INFO:__main__:Test Duration: 0.93s
-INFO:__main__:======================================================================
-INFO:__main__:Initializing job queue...
-INFO:core.job_queue:Starting job queue (max size: 100)
-INFO:core.job_queue:Loading jobs from /media/raid/agents/tools/mcp-transcriptor/outputs/jobs
-INFO:core.job_queue:Loaded 5 jobs from disk
-INFO:core.job_queue:Job queue worker loop started
-INFO:core.job_queue:Job queue worker started
-INFO:__main__:Job queue started (max_size=100, metadata_dir=/media/raid/agents/tools/mcp-transcriptor/outputs/jobs)
-INFO:core.gpu_health:Starting GPU health monitor (interval: 10.0 minutes)
-INFO:faster_whisper:Processing audio with duration 00:01.512
-INFO:faster_whisper:Detected language 'en' with probability 0.95
-INFO:core.gpu_health:GPU health check passed: NVIDIA GeForce RTX 3060, test duration: 0.38s
-INFO:__main__:GPU health monitor started (interval=10 minutes)
--- a/run_api_server.sh
+++ b/run_api_server.sh
@@ -11,6 +11,10 @@ export PYTHONPATH="/home/uad/agents/tools/mcp-transcriptor/src:$PYTHONPATH"
 # Set CUDA library path
 export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

+# Set proxy for model downloads
+export HTTP_PROXY=http://192.168.1.212:8080
+export HTTPS_PROXY=http://192.168.1.212:8080
+
 # Set environment variables
 export CUDA_VISIBLE_DEVICES=1
 export WHISPER_MODEL_DIR="/home/uad/agents/tools/mcp-transcriptor/data/models"
@@ -27,13 +31,13 @@ export TRANSCRIPTION_FILENAME_PREFIX=""

 # API server configuration
 export API_HOST="0.0.0.0"
-export API_PORT="8000"
+export API_PORT="33767"

 # GPU Auto-Reset Configuration
 export GPU_RESET_COOLDOWN_MINUTES=5  # Minimum time between GPU reset attempts

 # Job Queue Configuration
-export JOB_QUEUE_MAX_SIZE=100
+export JOB_QUEUE_MAX_SIZE=5
 export JOB_METADATA_DIR="/media/raid/agents/tools/mcp-transcriptor/outputs/jobs"
 export JOB_RETENTION_DAYS=7

--- a/run_mcp_server.sh
+++ b/run_mcp_server.sh
@@ -15,6 +15,10 @@ export PYTHONPATH="/home/uad/agents/tools/mcp-transcriptor/src:$PYTHONPATH"
 # Set CUDA library path
 export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

+# Set proxy for model downloads
+export HTTP_PROXY=http://192.168.1.212:8080
+export HTTPS_PROXY=http://192.168.1.212:8080
+
 # Set environment variables
 export CUDA_VISIBLE_DEVICES=1
 export WHISPER_MODEL_DIR="/home/uad/agents/tools/mcp-transcriptor/data/models"
--- a/src/servers/api_server.py
+++ b/src/servers/api_server.py
@@ -93,6 +93,7 @@ async def root():
            "GET /health/circuit-breaker": "Get circuit breaker stats",
            "POST /health/circuit-breaker/reset": "Reset circuit breaker",
            "GET /models": "Get available models information",
+            "POST /transcribe": "Upload audio file and submit transcription job",
            "POST /jobs": "Submit transcription job (async)",
            "GET /jobs/{job_id}": "Get job status",
            "GET /jobs/{job_id}/result": "Get job result",
@@ -123,6 +124,92 @@ async def get_models():
        raise HTTPException(status_code=500, detail=f"Failed to get model info: {str(e)}")


+@app.post("/transcribe")
+async def transcribe_upload(
+    file: UploadFile = File(...),
+    model: str = Form("medium"),
+    language: Optional[str] = Form(None),
+    output_format: str = Form("txt"),
+    beam_size: int = Form(5),
+    temperature: float = Form(0.0),
+    initial_prompt: Optional[str] = Form(None)
+):
+    """
+    Upload audio file and submit transcription job in one request.
+
+    Returns immediately with job_id. Poll GET /jobs/{job_id} for status.
+    """
+    temp_file_path = None
+    try:
+        # Save uploaded file to temp directory
+        upload_dir = Path(os.getenv("TRANSCRIPTION_OUTPUT_DIR", "/tmp")) / "uploads"
+        upload_dir.mkdir(parents=True, exist_ok=True)
+
+        # Create temp file with original filename
+        temp_file_path = upload_dir / file.filename
+
+        logger.info(f"Receiving upload: {file.filename} ({file.content_type})")
+
+        # Save uploaded file
+        with open(temp_file_path, "wb") as f:
+            content = await file.read()
+            f.write(content)
+
+        logger.info(f"Saved upload to: {temp_file_path}")
+
+        # Submit transcription job
+        job_info = job_queue.submit_job(
+            audio_path=str(temp_file_path),
+            model_name=model,
+            device="auto",
+            compute_type="auto",
+            language=language,
+            output_format=output_format,
+            beam_size=beam_size,
+            temperature=temperature,
+            initial_prompt=initial_prompt,
+            output_directory=None
+        )
+
+        return JSONResponse(
+            status_code=200,
+            content={
+                **job_info,
+                "message": f"File uploaded and job submitted. Poll /jobs/{job_info['job_id']} for status."
+            }
+        )
+
+    except queue_module.Full:
+        # Clean up temp file if queue is full
+        if temp_file_path and temp_file_path.exists():
+            temp_file_path.unlink()
+
+        logger.warning("Job queue is full, rejecting upload")
+        raise HTTPException(
+            status_code=503,
+            detail={
+                "error": "Queue full",
+                "message": f"Job queue is full. Please try again later.",
+                "queue_size": job_queue._max_queue_size,
+                "max_queue_size": job_queue._max_queue_size
+            }
+        )
+
+    except Exception as e:
+        # Clean up temp file on error
+        if temp_file_path and temp_file_path.exists():
+            temp_file_path.unlink()
+
+        logger.error(f"Failed to process upload: {e}")
+        raise HTTPException(
+            status_code=500,
+            detail={
+                "error": "Upload failed",
+                "message": str(e)
+            }
+        )
+
+
@app.post("/jobs")
 async def submit_job(request: SubmitJobRequest):
    """
--- a/src/utils/test_audio_generator.py
+++ b/src/utils/test_audio_generator.py
@@ -1,7 +1,7 @@
 """
 Test audio generator for GPU health checks.

-Generates realistic test audio with speech using TTS (text-to-speech).
+Returns path to existing test audio file - NO GENERATION, NO INTERNET.
 """

 import os
@@ -10,70 +10,35 @@ import tempfile

 def generate_test_audio(duration_seconds: float = 3.0, frequency: int = 440) -> str:
    """
-    Generate a test audio file with real speech for GPU health checks.
+    Return path to existing test audio file for GPU health checks.
+
+    NO AUDIO GENERATION - just returns path to pre-existing test file.
+    NO INTERNET CONNECTION REQUIRED.

    Args:
-        duration_seconds: Duration of audio in seconds (default: 3.0)
-        frequency: Legacy parameter, ignored (kept for backward compatibility)
+        duration_seconds: Duration hint (default: 3.0) - used for cache lookup
+        frequency: Legacy parameter, ignored

    Returns:
-        str: Path to temporary audio file
+        str: Path to test audio file

-    Implementation:
-        - Generate real speech using gTTS (Google Text-to-Speech)
-        - Fallback to pyttsx3 if gTTS fails or is unavailable
-        - Raises RuntimeError if both TTS engines fail
-        - Save as MP3 format
-        - Store in system temp directory
-        - Reuse same file if exists (cache)
+    Raises:
+        RuntimeError: If test audio file doesn't exist
    """
-    # Use a consistent filename in temp directory for caching
+    # Check for existing test audio in temp directory
    temp_dir = tempfile.gettempdir()
    audio_path = os.path.join(temp_dir, f"whisper_test_voice_{int(duration_seconds)}s.mp3")

    # Return cached file if it exists and is valid
-    if os.path.exists(audio_path):
-        try:
-            # Verify file is readable and not empty
-            if os.path.getsize(audio_path) > 0:
-                return audio_path
-        except Exception:
-            # If file is corrupted, regenerate it
-            pass
-
-    # Generate speech with different text based on duration
-    if duration_seconds >= 3:
-        text = "This is a test of the Whisper speech recognition system. Testing one, two, three."
-    elif duration_seconds >= 2:
-        text = "This is a test of the Whisper system."
-    else:
-        text = "Testing Whisper."
-
-    # Try gTTS first (better quality, requires internet)
-    try:
-        from gtts import gTTS
-        tts = gTTS(text=text, lang='en', slow=False)
-        tts.save(audio_path)
+    if os.path.exists(audio_path) and os.path.getsize(audio_path) > 0:
        return audio_path
-    except Exception as e:
-        print(f"gTTS failed ({e}), trying pyttsx3...")

-    # Fallback to pyttsx3 (offline, lower quality)
-    try:
-        import pyttsx3
-        engine = pyttsx3.init()
-        engine.save_to_file(text, audio_path)
-        engine.runAndWait()
-
-        # Verify file was created
-        if os.path.exists(audio_path) and os.path.getsize(audio_path) > 0:
-            return audio_path
-    except Exception as e:
-        raise RuntimeError(
-            f"Failed to generate test audio. Both gTTS and pyttsx3 failed. "
-            f"gTTS error: {e}. Please ensure TTS dependencies are installed: "
-            f"pip install gTTS pyttsx3"
-        )
+    # If no cached file, raise error - we don't generate anything
+    raise RuntimeError(
+        f"Test audio file not found: {audio_path}. "
+        f"Please ensure test audio exists before running GPU health checks. "
+        f"Expected file location: {audio_path}"
+    )


 def cleanup_test_audio() -> None:
--- a/supervisor/transcriptor-api.conf
+++ b/supervisor/transcriptor-api.conf
@@ -1,4 +1,4 @@
-[program:whisper-api-server]
+[program:transcriptor-api]
 command=/home/uad/agents/tools/mcp-transcriptor/venv/bin/python /home/uad/agents/tools/mcp-transcriptor/src/servers/api_server.py
 directory=/home/uad/agents/tools/mcp-transcriptor
 user=uad
@@ -12,7 +12,7 @@ environment=
    PYTHONPATH="/home/uad/agents/tools/mcp-transcriptor/src",
    CUDA_VISIBLE_DEVICES="0",
    API_HOST="0.0.0.0",
-    API_PORT="8000",
+    API_PORT="33767",
    WHISPER_MODEL_DIR="/home/uad/agents/tools/mcp-transcriptor/models",
    TRANSCRIPTION_OUTPUT_DIR="/home/uad/agents/tools/mcp-transcriptor/outputs",
    TRANSCRIPTION_BATCH_OUTPUT_DIR="/home/uad/agents/tools/mcp-transcriptor/outputs/batch",