mirror of https://github.com/PierrunoYT/Kokoro-TTS-Local.git synced 2025-01-27 02:30:25 +03:00

Files

Pierre Bruno f7753ccb62 Enhance Gradio interface and audio conversion capabilities

- Added audio format conversion functionality using pydub, supporting WAV, MP3, and AAC formats.
- Improved error handling for voice directory access and audio conversion processes.
- Updated README to reflect new web interface features and installation requirements, including FFmpeg.
- Enhanced the TTS generation function to utilize the correct Python interpreter across platforms.
- Documented new features in the README, including real-time progress monitoring and network sharing capabilities.

2025-01-16 16:19:31 +01:00

5.4 KiB

Raw Blame History

Kokoro TTS Local

A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.

Current Status

✅ WORKING - READY TO USE ✅

The project has been updated with:

Automatic espeak-ng installation and configuration
Dynamic module loading from Hugging Face
Improved error handling and debugging
Interactive CLI interface
Cross-platform setup scripts
Web interface with Gradio

Features

Local text-to-speech synthesis using the Kokoro model
Automatic espeak-ng setup using espeakng-loader
Multiple voice support with easy voice selection
Phoneme output support and visualization
Interactive CLI for custom text input
Voice listing functionality
Dynamic module loading from Hugging Face
Comprehensive error handling and logging
Cross-platform support (Windows, Linux, macOS)
NEW: Web Interface Features
- Modern, user-friendly UI
- Real-time generation progress
- Multiple output formats (WAV, MP3, AAC)
- Network sharing capabilities
- Audio playback and download
- Voice selection dropdown
- Detailed process logging

Prerequisites

Python 3.8 or higher
Git (for cloning the repository)
Internet connection (for initial model download)
FFmpeg (required for MP3/AAC conversion):
- Windows: Automatically installed with pydub
- Linux: sudo apt-get install ffmpeg
- macOS: brew install ffmpeg

Dependencies

torch
phonemizer-fork
transformers
scipy
munch
soundfile
huggingface-hub
espeakng-loader
gradio>=4.0.0
pydub  # For audio format conversion

Setup

Windows

# Clone the repository
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
cd Kokoro-TTS-Local

# Run the setup script
.\setup.ps1

Linux/macOS

# Clone the repository
git clone https://github.com/PierrunoYT/Kokoro-TTS-Local.git
cd Kokoro-TTS-Local

# Run the setup script
chmod +x setup.sh
./setup.sh

# Install FFmpeg (if needed)
# Linux:
sudo apt-get install ffmpeg
# macOS:
brew install ffmpeg

Manual Setup

If you prefer to set up manually:

Create a virtual environment:

# Windows
python -m venv venv
.\venv\Scripts\activate

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

Install dependencies:

python -m pip install --upgrade pip
pip install -r requirements.txt

Install system dependencies:

# Windows
# FFmpeg is automatically installed with pydub

# Linux
sudo apt-get update
sudo apt-get install espeak-ng ffmpeg

# macOS
brew install espeak ffmpeg

Usage

Web Interface

# Start the web interface
python gradio_interface.py

This will:

Launch a web interface at http://localhost:7860
Create a public share link (optional)
Allow you to:
- Input text to synthesize
- Select from available voices
- Choose output format (WAV/MP3/AAC)
- Monitor generation progress
- Play or download generated audio

Command Line Interface

python tts_demo.py

The script will:

Download necessary model files from Hugging Face
Set up espeak-ng automatically using espeakng-loader
Import required modules dynamically
Test the phonemizer functionality
Generate speech from your text with phoneme visualization
Save the output as 'output.wav' (22050Hz sample rate)

Project Structure

models.py: Core model loading and speech generation functionality
- Model building and initialization with dynamic imports
- Voice loading and management from Hugging Face
- Speech generation with phoneme output
- Voice listing functionality
- Automatic espeak-ng configuration
- Error handling and logging
tts_demo.py: Demo script showing basic usage
- Command-line interface with argparse
- Interactive text input mode
- Voice selection and listing
- Error handling and user feedback
gradio_interface.py: Web interface implementation
- Modern, responsive UI
- Real-time progress monitoring
- Multiple output formats
- Network sharing capabilities
setup.ps1: Windows PowerShell setup script
- Environment creation
- Dependency installation
- Automatic configuration
setup.sh: Linux/macOS bash setup script
- Environment creation
- Dependency installation
- Automatic configuration
requirements.txt: Project dependencies

Model Information

The project uses the Kokoro-82M model from Hugging Face:

Repository: hexgrad/Kokoro-82M
Model file: kokoro-v0_19.pth
Voice files: Located in the voices/ directory
Supports multiple voice styles (use --list-voices to see available options)
Automatically downloads required files from Hugging Face

Technical Details

Sample rate: 22050Hz
Input: Text in any language (English recommended)
Output: WAV/MP3/AAC audio file
Dependencies are automatically managed
Modules are dynamically loaded from Hugging Face
Error handling includes stack traces for debugging
Cross-platform compatibility through setup scripts

Contributing

Feel free to contribute by:

Opening issues for bugs or feature requests
Submitting pull requests with improvements
Helping with documentation
Testing different voices and reporting issues
Suggesting new features or optimizations
Testing on different platforms and reporting results

License

This project is licensed under the Apache 2.0 License.

5.4 KiB Raw Blame History