Files
Fast-Whisper-MCP-Server/TRANSCRIPTOR_API_FIX.md
Alihan fb1e5dceba Upgrade to PyTorch 2.6.0 and enhance GPU reset script with Ollama management
- Upgrade PyTorch and torchaudio to 2.6.0 with CUDA 12.4 support
- Update GPU reset script to gracefully stop/start Ollama via supervisorctl
- Add Docker Compose configuration for both API and MCP server modes
- Implement comprehensive Docker entrypoint for multi-mode deployment
- Add GPU health check cleanup to prevent memory leaks
- Fix transcription memory management with proper resource cleanup
- Add filename security validation to prevent path traversal attacks
- Include .dockerignore for optimized Docker builds
- Remove deprecated supervisor configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-27 23:01:22 +03:00

11 KiB

Transcriptor API - Filename Validation Bug Fix

Issue Summary

The transcriptor API is rejecting valid audio files due to overly strict path validation. Files with .. (double periods) anywhere in the filename are being rejected as potential path traversal attacks, even when they appear naturally in legitimate filenames.

Current Behavior

Error Observed

{
  "detail": {
    "error": "Upload failed",
    "message": "Audio file validation failed: Path traversal (..) is not allowed"
  }
}

HTTP Response

  • Status Code: 500
  • Endpoint: POST /transcribe
  • Request: File upload with filename containing ..

Example Failing Filename

This Weird FPV Drone only takes one kind of Battery... Rekon 35 V2.m4a
                                                      ^^^
                                            (Three dots, parsed as "..")

Root Cause Analysis

Current Validation Logic (Problematic)

The API is likely checking for .. anywhere in the filename string, which creates false positives:

# CURRENT (WRONG)
if ".." in filename:
    raise ValidationError("Path traversal (..) is not allowed")

This rejects legitimate filenames like:

  • "video...mp4" (ellipsis in title)
  • "Part 1... Part 2.m4a" (ellipsis separator)
  • "Wait... what.mp4" (dramatic pause)

Actual Security Concern

Path traversal attacks use .. as directory separators to navigate up the filesystem:

  • ../../etc/passwd (DANGEROUS)
  • ../../../secrets.txt (DANGEROUS)
  • video...mp4 (SAFE - just a filename)

Check for .. only when it appears as a complete path component, not as part of the filename text.

import os
from pathlib import Path

def validate_filename(filename: str) -> bool:
    """
    Validate filename for path traversal attacks.

    Returns True if safe, raises ValidationError if dangerous.
    """
    # Normalize the path
    normalized = os.path.normpath(filename)

    # Check if normalization changed the path (indicates traversal)
    if normalized != filename:
        raise ValidationError(f"Path traversal detected: {filename}")

    # Check for absolute paths
    if os.path.isabs(filename):
        raise ValidationError(f"Absolute paths not allowed: {filename}")

    # Split into components and check for parent directory references
    parts = Path(filename).parts
    if ".." in parts:
        raise ValidationError(f"Parent directory references not allowed: {filename}")

    # Check for any path separators (should be basename only)
    if os.sep in filename or (os.altsep and os.altsep in filename):
        raise ValidationError(f"Path separators not allowed: {filename}")

    return True

# Examples:
validate_filename("video.mp4")                    # ✓ PASS
validate_filename("video...mp4")                   # ✓ PASS (ellipsis)
validate_filename("This is... a video.m4a")        # ✓ PASS
validate_filename("../../../etc/passwd")           # ✗ FAIL (traversal)
validate_filename("dir/../file.mp4")               # ✗ FAIL (traversal)
validate_filename("/etc/passwd")                   # ✗ FAIL (absolute)

Option 2: Basename-Only Validation (Simpler)

Only accept basenames (no directory components at all):

import os

def validate_filename(filename: str) -> bool:
    """
    Ensure filename contains no path components.
    """
    # Extract basename
    basename = os.path.basename(filename)

    # Must match original (no path components)
    if basename != filename:
        raise ValidationError(f"Filename must not contain path components: {filename}")

    # Additional check: no path separators
    if "/" in filename or "\\" in filename:
        raise ValidationError(f"Path separators not allowed: {filename}")

    return True

# Examples:
validate_filename("video.mp4")                    # ✓ PASS
validate_filename("video...mp4")                   # ✓ PASS
validate_filename("../file.mp4")                   # ✗ FAIL
validate_filename("dir/file.mp4")                  # ✗ FAIL

Option 3: Regex Pattern Matching (Most Strict)

Use a whitelist approach for allowed characters:

import re

def validate_filename(filename: str) -> bool:
    """
    Validate filename using whitelist of safe characters.
    """
    # Allow: letters, numbers, spaces, dots, hyphens, underscores
    # Length: 1-255 characters
    pattern = r'^[a-zA-Z0-9 .\-_]{1,255}\.[a-zA-Z0-9]{2,10}$'

    if not re.match(pattern, filename):
        raise ValidationError(f"Invalid filename format: {filename}")

    # Additional safety: reject if starts/ends with dot
    if filename.startswith('.') or filename.endswith('.'):
        raise ValidationError(f"Filename cannot start or end with dot: {filename}")

    return True

# Examples:
validate_filename("video.mp4")                    # ✓ PASS
validate_filename("video...mp4")                   # ✓ PASS
validate_filename("My Video... Part 2.m4a")        # ✓ PASS
validate_filename("../file.mp4")                   # ✗ FAIL (starts with ..)
validate_filename("file<>.mp4")                    # ✗ FAIL (invalid chars)

Implementation Steps

1. Locate Current Validation Code

Search for files containing the validation logic:

grep -r "Path traversal" /path/to/transcriptor-api
grep -r '".."' /path/to/transcriptor-api
grep -r "normpath\|basename" /path/to/transcriptor-api

2. Update Validation Function

Replace the current naive check with one of the recommended solutions above.

Priority Order:

  1. Option 1 (Path Component Validation) - Best security/usability balance
  2. Option 2 (Basename-Only) - Simplest, very secure
  3. Option 3 (Regex) - Most restrictive, may reject valid files

3. Test Cases

Create comprehensive test suite:

import pytest

def test_valid_filenames():
    """Test filenames that should be accepted."""
    valid_names = [
        "video.mp4",
        "audio.m4a",
        "This is... a test.mp4",
        "Part 1... Part 2.wav",
        "video...multiple...dots.mp3",
        "My-Video_2024.mp4",
        "song (remix).m4a",
    ]

    for filename in valid_names:
        assert validate_filename(filename), f"Should accept: {filename}"

def test_dangerous_filenames():
    """Test filenames that should be rejected."""
    dangerous_names = [
        "../../../etc/passwd",
        "../../secrets.txt",
        "../file.mp4",
        "/etc/passwd",
        "C:\\Windows\\System32\\file.txt",
        "dir/../file.mp4",
        "file/../../etc/passwd",
    ]

    for filename in dangerous_names:
        with pytest.raises(ValidationError):
            validate_filename(filename)

def test_edge_cases():
    """Test edge cases."""
    edge_cases = [
        (".", False),           # Current directory
        ("..", False),          # Parent directory
        ("...", True),          # Just dots (valid)
        ("....", True),         # Multiple dots (valid)
        (".hidden.mp4", True),  # Hidden file (valid on Unix)
        ("", False),            # Empty string
        ("a" * 256, False),     # Too long
    ]

    for filename, should_pass in edge_cases:
        if should_pass:
            assert validate_filename(filename)
        else:
            with pytest.raises(ValidationError):
                validate_filename(filename)

4. Update Error Response

Provide clearer error messages:

# BAD (current)
{"detail": {"error": "Upload failed", "message": "Audio file validation failed: Path traversal (..) is not allowed"}}

# GOOD (improved)
{
  "detail": {
    "error": "Invalid filename",
    "message": "Filename contains path traversal characters. Please use only the filename without directory paths.",
    "filename": "../../etc/passwd",
    "suggestion": "Use: passwd.txt"
  }
}

Testing the Fix

Manual Testing

  1. Test with problematic filename from bug report:

    curl -X POST http://192.168.1.210:33767/transcribe \
      -F "file=@/path/to/This Weird FPV Drone only takes one kind of Battery... Rekon 35 V2.m4a" \
      -F "model=medium"
    

    Expected: HTTP 200 (success)

  2. Test with actual path traversal:

    curl -X POST http://192.168.1.210:33767/transcribe \
      -F "file=@/tmp/test.m4a;filename=../../etc/passwd" \
      -F "model=medium"
    

    Expected: HTTP 400 (validation error)

  3. Test with various ellipsis patterns:

    • "video...mp4" → Should pass
    • "Part 1... Part 2.m4a" → Should pass
    • "Wait... what!.mp4" → Should pass

Automated Testing

# integration_test.py
import requests

def test_ellipsis_filenames():
    """Test files with ellipsis in names."""
    test_cases = [
        "video...mp4",
        "This is... a test.m4a",
        "Wait... what.mp3",
    ]

    for filename in test_cases:
        response = requests.post(
            "http://192.168.1.210:33767/transcribe",
            files={"file": (filename, open("test_audio.m4a", "rb"))},
            data={"model": "medium"}
        )
        assert response.status_code == 200, f"Failed for: {filename}"

Security Considerations

What We're Protecting Against

  1. Path Traversal: ../../../sensitive/file
  2. Absolute Paths: /etc/passwd or C:\Windows\System32\
  3. Hidden Paths: ./.git/config

What We're NOT Breaking

  1. Ellipsis in titles: "Wait... what.mp4"
  2. Multiple extensions: "file.tar.gz"
  3. Special characters: "My Video (2024).mp4"

Additional Hardening (Optional)

def sanitize_and_validate_filename(filename: str) -> str:
    """
    Sanitize filename and validate for safety.
    Returns cleaned filename or raises error.
    """
    # Remove null bytes
    filename = filename.replace("\0", "")

    # Extract basename (strips any path components)
    filename = os.path.basename(filename)

    # Limit length
    max_length = 255
    if len(filename) > max_length:
        name, ext = os.path.splitext(filename)
        filename = name[:max_length-len(ext)] + ext

    # Validate
    validate_filename(filename)

    return filename

Deployment Checklist

  • Update validation function with recommended fix
  • Add comprehensive test suite
  • Test with real-world filenames (including bug report case)
  • Test security: attempt path traversal attacks
  • Update API documentation
  • Review error messages for clarity
  • Deploy to staging environment
  • Run integration tests
  • Monitor logs for validation failures
  • Deploy to production
  • Verify bug reporter's file now works

Contact & Context

Bug Report Date: 2025-10-26 Affected Endpoint: POST /transcribe Error Code: HTTP 500 Client Application: yt-dlp-webui v3

Example Failing Request:

POST http://192.168.1.210:33767/transcribe
Content-Type: multipart/form-data

file: "This Weird FPV Drone only takes one kind of Battery... Rekon 35 V2.m4a"
model: "medium"

Current Behavior: Returns 500 error with path traversal message Expected Behavior: Accepts file and processes transcription


Quick Reference

Files to Check

  • /path/to/api/validators.py or similar
  • /path/to/api/upload_handler.py
  • /path/to/api/routes/transcribe.py

Search Commands

# Find validation code
rg "Path traversal" --type py
rg '"\.\."' --type py
rg "ValidationError.*filename" --type py

# Find upload handlers
rg "def.*upload|def.*transcribe" --type py

Priority Fix

Use Option 1 (Path Component Validation) - it provides the best balance of security and usability.