mirror of
https://github.com/PierrunoYT/Kokoro-TTS-Local.git
synced 2025-01-27 02:30:25 +03:00
Enhance Gradio interface and audio conversion capabilities
- Added audio format conversion functionality using pydub, supporting WAV, MP3, and AAC formats. - Improved error handling for voice directory access and audio conversion processes. - Updated README to reflect new web interface features and installation requirements, including FFmpeg. - Enhanced the TTS generation function to utilize the correct Python interpreter across platforms. - Documented new features in the README, including real-time progress monitoring and network sharing capabilities.
This commit is contained in:
91
README.md
91
README.md
@@ -1,6 +1,6 @@
|
|||||||
# Kokoro TTS Local
|
# Kokoro TTS Local
|
||||||
|
|
||||||
A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading and automatic dependency management.
|
A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.
|
||||||
|
|
||||||
## Current Status
|
## Current Status
|
||||||
|
|
||||||
@@ -12,6 +12,7 @@ The project has been updated with:
|
|||||||
- Improved error handling and debugging
|
- Improved error handling and debugging
|
||||||
- Interactive CLI interface
|
- Interactive CLI interface
|
||||||
- Cross-platform setup scripts
|
- Cross-platform setup scripts
|
||||||
|
- Web interface with Gradio
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
@@ -24,12 +25,24 @@ The project has been updated with:
|
|||||||
- Dynamic module loading from Hugging Face
|
- Dynamic module loading from Hugging Face
|
||||||
- Comprehensive error handling and logging
|
- Comprehensive error handling and logging
|
||||||
- Cross-platform support (Windows, Linux, macOS)
|
- Cross-platform support (Windows, Linux, macOS)
|
||||||
|
- **NEW: Web Interface Features**
|
||||||
|
- Modern, user-friendly UI
|
||||||
|
- Real-time generation progress
|
||||||
|
- Multiple output formats (WAV, MP3, AAC)
|
||||||
|
- Network sharing capabilities
|
||||||
|
- Audio playback and download
|
||||||
|
- Voice selection dropdown
|
||||||
|
- Detailed process logging
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- Python 3.8 or higher
|
- Python 3.8 or higher
|
||||||
- Git (for cloning the repository)
|
- Git (for cloning the repository)
|
||||||
- Internet connection (for initial model download)
|
- Internet connection (for initial model download)
|
||||||
|
- FFmpeg (required for MP3/AAC conversion):
|
||||||
|
- Windows: Automatically installed with pydub
|
||||||
|
- Linux: `sudo apt-get install ffmpeg`
|
||||||
|
- macOS: `brew install ffmpeg`
|
||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
@@ -42,21 +55,37 @@ munch
|
|||||||
soundfile
|
soundfile
|
||||||
huggingface-hub
|
huggingface-hub
|
||||||
espeakng-loader
|
espeakng-loader
|
||||||
|
gradio>=4.0.0
|
||||||
|
pydub # For audio format conversion
|
||||||
```
|
```
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
### Windows
|
### Windows
|
||||||
Run the PowerShell setup script:
|
|
||||||
```powershell
|
```powershell
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
|
||||||
|
cd Kokoro-TTS-Local
|
||||||
|
|
||||||
|
# Run the setup script
|
||||||
.\setup.ps1
|
.\setup.ps1
|
||||||
```
|
```
|
||||||
|
|
||||||
### Linux/macOS
|
### Linux/macOS
|
||||||
Run the bash setup script:
|
|
||||||
```bash
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
|
||||||
|
cd Kokoro-TTS-Local
|
||||||
|
|
||||||
|
# Run the setup script
|
||||||
chmod +x setup.sh
|
chmod +x setup.sh
|
||||||
./setup.sh
|
./setup.sh
|
||||||
|
|
||||||
|
# Install FFmpeg (if needed)
|
||||||
|
# Linux:
|
||||||
|
sudo apt-get install ffmpeg
|
||||||
|
# macOS:
|
||||||
|
brew install ffmpeg
|
||||||
```
|
```
|
||||||
|
|
||||||
### Manual Setup
|
### Manual Setup
|
||||||
@@ -79,34 +108,37 @@ python -m pip install --upgrade pip
|
|||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
3. Install system dependencies:
|
||||||
|
```bash
|
||||||
|
# Windows
|
||||||
|
# FFmpeg is automatically installed with pydub
|
||||||
|
|
||||||
|
# Linux
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install espeak-ng ffmpeg
|
||||||
|
|
||||||
|
# macOS
|
||||||
|
brew install espeak ffmpeg
|
||||||
|
```
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### List Available Voices
|
### Web Interface
|
||||||
To see all available voices from the Hugging Face repository:
|
|
||||||
```bash
|
```bash
|
||||||
python tts_demo.py --list-voices
|
# Start the web interface
|
||||||
|
python gradio_interface.py
|
||||||
```
|
```
|
||||||
|
This will:
|
||||||
|
1. Launch a web interface at http://localhost:7860
|
||||||
|
2. Create a public share link (optional)
|
||||||
|
3. Allow you to:
|
||||||
|
- Input text to synthesize
|
||||||
|
- Select from available voices
|
||||||
|
- Choose output format (WAV/MP3/AAC)
|
||||||
|
- Monitor generation progress
|
||||||
|
- Play or download generated audio
|
||||||
|
|
||||||
### Basic Usage
|
### Command Line Interface
|
||||||
Run the demo script with default text and voice:
|
|
||||||
```bash
|
|
||||||
python tts_demo.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### Custom Text
|
|
||||||
Specify your own text:
|
|
||||||
```bash
|
|
||||||
python tts_demo.py --text "Your custom text here"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Voice Selection
|
|
||||||
Choose a different voice (use --list-voices to see available options):
|
|
||||||
```bash
|
|
||||||
python tts_demo.py --voice "af" --text "Custom text with specific voice"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Interactive Mode
|
|
||||||
If you run without any arguments, you'll be prompted to enter text interactively:
|
|
||||||
```bash
|
```bash
|
||||||
python tts_demo.py
|
python tts_demo.py
|
||||||
```
|
```
|
||||||
@@ -133,6 +165,11 @@ The script will:
|
|||||||
- Interactive text input mode
|
- Interactive text input mode
|
||||||
- Voice selection and listing
|
- Voice selection and listing
|
||||||
- Error handling and user feedback
|
- Error handling and user feedback
|
||||||
|
- `gradio_interface.py`: Web interface implementation
|
||||||
|
- Modern, responsive UI
|
||||||
|
- Real-time progress monitoring
|
||||||
|
- Multiple output formats
|
||||||
|
- Network sharing capabilities
|
||||||
- `setup.ps1`: Windows PowerShell setup script
|
- `setup.ps1`: Windows PowerShell setup script
|
||||||
- Environment creation
|
- Environment creation
|
||||||
- Dependency installation
|
- Dependency installation
|
||||||
@@ -156,7 +193,7 @@ The project uses the Kokoro-82M model from Hugging Face:
|
|||||||
|
|
||||||
- Sample rate: 22050Hz
|
- Sample rate: 22050Hz
|
||||||
- Input: Text in any language (English recommended)
|
- Input: Text in any language (English recommended)
|
||||||
- Output: WAV audio file
|
- Output: WAV/MP3/AAC audio file
|
||||||
- Dependencies are automatically managed
|
- Dependencies are automatically managed
|
||||||
- Modules are dynamically loaded from Hugging Face
|
- Modules are dynamically loaded from Hugging Face
|
||||||
- Error handling includes stack traces for debugging
|
- Error handling includes stack traces for debugging
|
||||||
|
|||||||
@@ -14,18 +14,20 @@ Key Features:
|
|||||||
Dependencies:
|
Dependencies:
|
||||||
- gradio: Web interface framework
|
- gradio: Web interface framework
|
||||||
- soundfile: Audio file handling
|
- soundfile: Audio file handling
|
||||||
|
- pydub: Audio format conversion
|
||||||
- models: Custom module for voice model management
|
- models: Custom module for voice model management
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import gradio as gr
|
import gradio as gr
|
||||||
import subprocess
|
import subprocess
|
||||||
import os
|
import os
|
||||||
|
import sys
|
||||||
import platform
|
import platform
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
import shutil
|
import shutil
|
||||||
import json
|
|
||||||
import soundfile as sf
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
import soundfile as sf
|
||||||
|
from pydub import AudioSegment
|
||||||
|
|
||||||
# Global configuration
|
# Global configuration
|
||||||
CONFIG_FILE = "tts_config.json" # Stores user preferences and paths
|
CONFIG_FILE = "tts_config.json" # Stores user preferences and paths
|
||||||
@@ -42,51 +44,52 @@ def get_default_voices_path():
|
|||||||
|
|
||||||
def get_available_voices():
|
def get_available_voices():
|
||||||
"""Get list of available voice models by checking the directory."""
|
"""Get list of available voice models by checking the directory."""
|
||||||
voices_path = get_default_voices_path() # Use platform-agnostic path
|
voices_path = get_default_voices_path()
|
||||||
try:
|
try:
|
||||||
# List all files in the directory and filter by .pt extension
|
if not os.path.exists(voices_path):
|
||||||
|
print(f"Voices directory not found: {voices_path}")
|
||||||
|
return []
|
||||||
voices = [os.path.splitext(f)[0] for f in os.listdir(voices_path) if f.endswith('.pt')]
|
voices = [os.path.splitext(f)[0] for f in os.listdir(voices_path) if f.endswith('.pt')]
|
||||||
print("Available voices:", voices) # Debugging log
|
print("Available voices:", voices)
|
||||||
return voices
|
return voices
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error retrieving voices: {e}")
|
print(f"Error retrieving voices: {e}")
|
||||||
return [] # Return an empty list if there's an error
|
return []
|
||||||
|
|
||||||
|
def convert_audio(input_path: str, output_path: str, format: str):
|
||||||
|
"""Convert audio to specified format using pydub."""
|
||||||
|
try:
|
||||||
|
audio = AudioSegment.from_wav(input_path)
|
||||||
|
if format == "mp3":
|
||||||
|
audio.export(output_path, format="mp3", bitrate="192k")
|
||||||
|
elif format == "aac":
|
||||||
|
audio.export(output_path, format="aac", bitrate="192k")
|
||||||
|
else: # wav
|
||||||
|
shutil.copy2(input_path, output_path)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error converting audio: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
def generate_tts_with_logs(voice, text, format):
|
def generate_tts_with_logs(voice, text, format):
|
||||||
"""Generate TTS audio with real-time logging and format conversion.
|
"""Generate TTS audio with real-time logging and format conversion."""
|
||||||
|
|
||||||
This function:
|
|
||||||
1. Validates input text
|
|
||||||
2. Runs TTS generation subprocess
|
|
||||||
3. Streams progress logs in real-time
|
|
||||||
4. Converts output to requested format
|
|
||||||
5. Saves with timestamp in output directory
|
|
||||||
|
|
||||||
Args:
|
|
||||||
voice (str): Selected voice model identifier (e.g., "af", "af_bella")
|
|
||||||
text (str): Input text to synthesize
|
|
||||||
format (str): Output audio format ("wav", "mp3", or "aac")
|
|
||||||
|
|
||||||
Yields:
|
|
||||||
tuple: (log_text, output_path)
|
|
||||||
- log_text (str): Accumulated process logs
|
|
||||||
- output_path (str): Path to generated audio file, or None on error
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- Temporary WAV file is created and deleted after conversion
|
|
||||||
- Output filename includes timestamp to prevent overwrites
|
|
||||||
- Errors are caught and reported in logs
|
|
||||||
"""
|
|
||||||
if not text.strip():
|
if not text.strip():
|
||||||
return "❌ Error: Text required", None
|
return "❌ Error: Text required", None
|
||||||
|
|
||||||
logs_text = ""
|
logs_text = ""
|
||||||
try:
|
try:
|
||||||
|
# Use sys.executable to ensure correct Python interpreter
|
||||||
|
cmd = [sys.executable, "tts_demo.py", "--text", text, "--voice", voice]
|
||||||
|
|
||||||
|
# Use shell=True on Windows
|
||||||
|
shell = platform.system().lower() == "windows"
|
||||||
|
|
||||||
process = subprocess.Popen(
|
process = subprocess.Popen(
|
||||||
["python", "tts_demo.py", "--text", text, "--voice", voice],
|
cmd,
|
||||||
stdout=subprocess.PIPE,
|
stdout=subprocess.PIPE,
|
||||||
stderr=subprocess.STDOUT,
|
stderr=subprocess.STDOUT,
|
||||||
universal_newlines=True
|
universal_newlines=True,
|
||||||
|
shell=shell
|
||||||
)
|
)
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
@@ -112,15 +115,14 @@ def generate_tts_with_logs(voice, text, format):
|
|||||||
os.makedirs(DEFAULT_OUTPUT_DIR, exist_ok=True)
|
os.makedirs(DEFAULT_OUTPUT_DIR, exist_ok=True)
|
||||||
output_path = Path(DEFAULT_OUTPUT_DIR) / filename
|
output_path = Path(DEFAULT_OUTPUT_DIR) / filename
|
||||||
|
|
||||||
if format == "wav":
|
# Convert audio using pydub
|
||||||
shutil.copy2("output.wav", output_path)
|
if convert_audio("output.wav", str(output_path), format):
|
||||||
|
logs_text += f"✅ Saved: {output_path}\n"
|
||||||
|
os.remove("output.wav")
|
||||||
|
yield logs_text, str(output_path)
|
||||||
else:
|
else:
|
||||||
data, samplerate = sf.read("output.wav")
|
logs_text += "❌ Audio conversion failed\n"
|
||||||
sf.write(str(output_path), data, samplerate)
|
yield logs_text, None
|
||||||
|
|
||||||
os.remove("output.wav")
|
|
||||||
logs_text += f"✅ Saved: {output_path}\n"
|
|
||||||
yield logs_text, str(output_path)
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logs_text += f"❌ Error: {str(e)}\n"
|
logs_text += f"❌ Error: {str(e)}\n"
|
||||||
|
|||||||
@@ -5,4 +5,6 @@ scipy
|
|||||||
munch
|
munch
|
||||||
soundfile
|
soundfile
|
||||||
huggingface-hub
|
huggingface-hub
|
||||||
espeakng-loader
|
espeakng-loader
|
||||||
|
gradio>=4.0.0
|
||||||
|
pydub # For audio format conversion
|
||||||
Reference in New Issue
Block a user