Enhance Gradio interface and audio conversion capabilities

- Added audio format conversion functionality using pydub, supporting WAV, MP3, and AAC formats.
- Improved error handling for voice directory access and audio conversion processes.
- Updated README to reflect new web interface features and installation requirements, including FFmpeg.
- Enhanced the TTS generation function to utilize the correct Python interpreter across platforms.
- Documented new features in the README, including real-time progress monitoring and network sharing capabilities.
This commit is contained in:
Pierre Bruno
2025-01-16 16:19:31 +01:00
parent 49e19f0c51
commit f7753ccb62
3 changed files with 109 additions and 68 deletions

View File

@@ -1,6 +1,6 @@
# Kokoro TTS Local # Kokoro TTS Local
A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading and automatic dependency management. A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.
## Current Status ## Current Status
@@ -12,6 +12,7 @@ The project has been updated with:
- Improved error handling and debugging - Improved error handling and debugging
- Interactive CLI interface - Interactive CLI interface
- Cross-platform setup scripts - Cross-platform setup scripts
- Web interface with Gradio
## Features ## Features
@@ -24,12 +25,24 @@ The project has been updated with:
- Dynamic module loading from Hugging Face - Dynamic module loading from Hugging Face
- Comprehensive error handling and logging - Comprehensive error handling and logging
- Cross-platform support (Windows, Linux, macOS) - Cross-platform support (Windows, Linux, macOS)
- **NEW: Web Interface Features**
- Modern, user-friendly UI
- Real-time generation progress
- Multiple output formats (WAV, MP3, AAC)
- Network sharing capabilities
- Audio playback and download
- Voice selection dropdown
- Detailed process logging
## Prerequisites ## Prerequisites
- Python 3.8 or higher - Python 3.8 or higher
- Git (for cloning the repository) - Git (for cloning the repository)
- Internet connection (for initial model download) - Internet connection (for initial model download)
- FFmpeg (required for MP3/AAC conversion):
- Windows: Automatically installed with pydub
- Linux: `sudo apt-get install ffmpeg`
- macOS: `brew install ffmpeg`
## Dependencies ## Dependencies
@@ -42,21 +55,37 @@ munch
soundfile soundfile
huggingface-hub huggingface-hub
espeakng-loader espeakng-loader
gradio>=4.0.0
pydub # For audio format conversion
``` ```
## Setup ## Setup
### Windows ### Windows
Run the PowerShell setup script:
```powershell ```powershell
# Clone the repository
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
cd Kokoro-TTS-Local
# Run the setup script
.\setup.ps1 .\setup.ps1
``` ```
### Linux/macOS ### Linux/macOS
Run the bash setup script:
```bash ```bash
# Clone the repository
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
cd Kokoro-TTS-Local
# Run the setup script
chmod +x setup.sh chmod +x setup.sh
./setup.sh ./setup.sh
# Install FFmpeg (if needed)
# Linux:
sudo apt-get install ffmpeg
# macOS:
brew install ffmpeg
``` ```
### Manual Setup ### Manual Setup
@@ -79,34 +108,37 @@ python -m pip install --upgrade pip
pip install -r requirements.txt pip install -r requirements.txt
``` ```
3. Install system dependencies:
```bash
# Windows
# FFmpeg is automatically installed with pydub
# Linux
sudo apt-get update
sudo apt-get install espeak-ng ffmpeg
# macOS
brew install espeak ffmpeg
```
## Usage ## Usage
### List Available Voices ### Web Interface
To see all available voices from the Hugging Face repository:
```bash ```bash
python tts_demo.py --list-voices # Start the web interface
python gradio_interface.py
``` ```
This will:
1. Launch a web interface at http://localhost:7860
2. Create a public share link (optional)
3. Allow you to:
- Input text to synthesize
- Select from available voices
- Choose output format (WAV/MP3/AAC)
- Monitor generation progress
- Play or download generated audio
### Basic Usage ### Command Line Interface
Run the demo script with default text and voice:
```bash
python tts_demo.py
```
### Custom Text
Specify your own text:
```bash
python tts_demo.py --text "Your custom text here"
```
### Voice Selection
Choose a different voice (use --list-voices to see available options):
```bash
python tts_demo.py --voice "af" --text "Custom text with specific voice"
```
### Interactive Mode
If you run without any arguments, you'll be prompted to enter text interactively:
```bash ```bash
python tts_demo.py python tts_demo.py
``` ```
@@ -133,6 +165,11 @@ The script will:
- Interactive text input mode - Interactive text input mode
- Voice selection and listing - Voice selection and listing
- Error handling and user feedback - Error handling and user feedback
- `gradio_interface.py`: Web interface implementation
- Modern, responsive UI
- Real-time progress monitoring
- Multiple output formats
- Network sharing capabilities
- `setup.ps1`: Windows PowerShell setup script - `setup.ps1`: Windows PowerShell setup script
- Environment creation - Environment creation
- Dependency installation - Dependency installation
@@ -156,7 +193,7 @@ The project uses the Kokoro-82M model from Hugging Face:
- Sample rate: 22050Hz - Sample rate: 22050Hz
- Input: Text in any language (English recommended) - Input: Text in any language (English recommended)
- Output: WAV audio file - Output: WAV/MP3/AAC audio file
- Dependencies are automatically managed - Dependencies are automatically managed
- Modules are dynamically loaded from Hugging Face - Modules are dynamically loaded from Hugging Face
- Error handling includes stack traces for debugging - Error handling includes stack traces for debugging

View File

@@ -14,18 +14,20 @@ Key Features:
Dependencies: Dependencies:
- gradio: Web interface framework - gradio: Web interface framework
- soundfile: Audio file handling - soundfile: Audio file handling
- pydub: Audio format conversion
- models: Custom module for voice model management - models: Custom module for voice model management
""" """
import gradio as gr import gradio as gr
import subprocess import subprocess
import os import os
import sys
import platform import platform
from datetime import datetime from datetime import datetime
import shutil import shutil
import json
import soundfile as sf
from pathlib import Path from pathlib import Path
import soundfile as sf
from pydub import AudioSegment
# Global configuration # Global configuration
CONFIG_FILE = "tts_config.json" # Stores user preferences and paths CONFIG_FILE = "tts_config.json" # Stores user preferences and paths
@@ -42,51 +44,52 @@ def get_default_voices_path():
def get_available_voices(): def get_available_voices():
"""Get list of available voice models by checking the directory.""" """Get list of available voice models by checking the directory."""
voices_path = get_default_voices_path() # Use platform-agnostic path voices_path = get_default_voices_path()
try: try:
# List all files in the directory and filter by .pt extension if not os.path.exists(voices_path):
print(f"Voices directory not found: {voices_path}")
return []
voices = [os.path.splitext(f)[0] for f in os.listdir(voices_path) if f.endswith('.pt')] voices = [os.path.splitext(f)[0] for f in os.listdir(voices_path) if f.endswith('.pt')]
print("Available voices:", voices) # Debugging log print("Available voices:", voices)
return voices return voices
except Exception as e: except Exception as e:
print(f"Error retrieving voices: {e}") print(f"Error retrieving voices: {e}")
return [] # Return an empty list if there's an error return []
def convert_audio(input_path: str, output_path: str, format: str):
"""Convert audio to specified format using pydub."""
try:
audio = AudioSegment.from_wav(input_path)
if format == "mp3":
audio.export(output_path, format="mp3", bitrate="192k")
elif format == "aac":
audio.export(output_path, format="aac", bitrate="192k")
else: # wav
shutil.copy2(input_path, output_path)
return True
except Exception as e:
print(f"Error converting audio: {e}")
return False
def generate_tts_with_logs(voice, text, format): def generate_tts_with_logs(voice, text, format):
"""Generate TTS audio with real-time logging and format conversion. """Generate TTS audio with real-time logging and format conversion."""
This function:
1. Validates input text
2. Runs TTS generation subprocess
3. Streams progress logs in real-time
4. Converts output to requested format
5. Saves with timestamp in output directory
Args:
voice (str): Selected voice model identifier (e.g., "af", "af_bella")
text (str): Input text to synthesize
format (str): Output audio format ("wav", "mp3", or "aac")
Yields:
tuple: (log_text, output_path)
- log_text (str): Accumulated process logs
- output_path (str): Path to generated audio file, or None on error
Notes:
- Temporary WAV file is created and deleted after conversion
- Output filename includes timestamp to prevent overwrites
- Errors are caught and reported in logs
"""
if not text.strip(): if not text.strip():
return "❌ Error: Text required", None return "❌ Error: Text required", None
logs_text = "" logs_text = ""
try: try:
# Use sys.executable to ensure correct Python interpreter
cmd = [sys.executable, "tts_demo.py", "--text", text, "--voice", voice]
# Use shell=True on Windows
shell = platform.system().lower() == "windows"
process = subprocess.Popen( process = subprocess.Popen(
["python", "tts_demo.py", "--text", text, "--voice", voice], cmd,
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, stderr=subprocess.STDOUT,
universal_newlines=True universal_newlines=True,
shell=shell
) )
while True: while True:
@@ -112,15 +115,14 @@ def generate_tts_with_logs(voice, text, format):
os.makedirs(DEFAULT_OUTPUT_DIR, exist_ok=True) os.makedirs(DEFAULT_OUTPUT_DIR, exist_ok=True)
output_path = Path(DEFAULT_OUTPUT_DIR) / filename output_path = Path(DEFAULT_OUTPUT_DIR) / filename
if format == "wav": # Convert audio using pydub
shutil.copy2("output.wav", output_path) if convert_audio("output.wav", str(output_path), format):
logs_text += f"✅ Saved: {output_path}\n"
os.remove("output.wav")
yield logs_text, str(output_path)
else: else:
data, samplerate = sf.read("output.wav") logs_text += "❌ Audio conversion failed\n"
sf.write(str(output_path), data, samplerate) yield logs_text, None
os.remove("output.wav")
logs_text += f"✅ Saved: {output_path}\n"
yield logs_text, str(output_path)
except Exception as e: except Exception as e:
logs_text += f"❌ Error: {str(e)}\n" logs_text += f"❌ Error: {str(e)}\n"

View File

@@ -5,4 +5,6 @@ scipy
munch munch
soundfile soundfile
huggingface-hub huggingface-hub
espeakng-loader espeakng-loader
gradio>=4.0.0
pydub # For audio format conversion