mirror of
https://github.com/neuphonic/neutts-air.git
synced 2025-10-10 02:44:44 +03:00
65
README.md
65
README.md
@@ -2,9 +2,7 @@
|
||||
|
||||
HuggingFace 🤗: [Model](https://huggingface.co/neuphonic/neutts-air), [Q8 GGUF](https://huggingface.co/neuphonic/neutts-air-q8-gguf), [Q4 GGUF](https://huggingface.co/neuphonic/neutts-air-q4-gguf) [Spaces](https://huggingface.co/spaces/neuphonic/neutts-air)
|
||||
|
||||
<a href="https://www.youtube.com/watch?v=YAB3hCtu5wE"><img width="1920" height="1080" alt="image" src="https://github.com/user-attachments/assets/ec8efcaf-ef79-4c16-b549-ddebc2256c2f" /></a>
|
||||
|
||||
Click the image above to watch NeuTTS Air in action on YouTube!
|
||||
[Demo Video](https://github.com/user-attachments/assets/020547bc-9e3e-440f-b016-ae61ca645184)
|
||||
|
||||
*Created by [Neuphonic](http://neuphonic.com/) - building faster, smaller, on-device voice AI*
|
||||
|
||||
@@ -20,8 +18,9 @@ State-of-the-art Voice AI has been locked behind web APIs for too long. NeuTTS A
|
||||
## Model Details
|
||||
|
||||
NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model optimised for text understanding and generation - as well as a powerful combination of technologies designed for efficiency and quality:
|
||||
|
||||
- **Audio Codec**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) - our proprietary neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook
|
||||
- **Supported Languages**: English
|
||||
- **Audio Codec**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) - our 50hz neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook
|
||||
- **Context Window**: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)
|
||||
- **Format**: Available in GGML format for efficient on-device inference
|
||||
- **Responsibility**: Watermarked outputs
|
||||
- **Inference Speed**: Real-time generation on mid-range devices
|
||||
@@ -90,7 +89,7 @@ NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model opt
|
||||
pip install onnxruntime
|
||||
```
|
||||
|
||||
## Basic Example
|
||||
## Running the Model
|
||||
|
||||
Run the basic example script to synthesize speech:
|
||||
```bash
|
||||
@@ -104,13 +103,18 @@ To specify a particular model repo for the backbone or codec, add the `--backbon
|
||||
|
||||
Several examples are available, including a Jupyter notebook in the `examples` folder.
|
||||
|
||||
### Simple One-Code Block Usage
|
||||
### One-Code Block Usage
|
||||
|
||||
```python
|
||||
from neuttsair.neutts import NeuTTSAir
|
||||
import soundfile as sf
|
||||
|
||||
tts = NeuTTSAir( backbone_repo="neuphonic/neutts-air-q4-gguf", backbone_device="cpu", codec_repo="neuphonic/neucodec", codec_device="cpu")
|
||||
tts = NeuTTSAir(
|
||||
backbone_repo="neuphonic/neutts-air", # or 'neutts-air-q4-gguf' wit llama-cpp-python installed
|
||||
backbone_device="cpu",
|
||||
codec_repo="neuphonic/neucodec",
|
||||
codec_device="cpu"
|
||||
)
|
||||
input_text = "My name is Dave, and um, I'm from London."
|
||||
|
||||
ref_text = "samples/dave.txt"
|
||||
@@ -123,40 +127,7 @@ wav = tts.infer(input_text, ref_codes, ref_text)
|
||||
sf.write("test.wav", wav, 24000)
|
||||
```
|
||||
|
||||
|
||||
## Advanced Examples
|
||||
### GGML Backbone Example
|
||||
```bash
|
||||
python -m examples.basic_example \
|
||||
--input_text "My name is Dave, and um, I'm from London" \
|
||||
--ref_audio ./samples/dave.wav \
|
||||
--ref_text ./samples/dave.txt \
|
||||
--backbone neuphonic/neutts-air-q4-gguf
|
||||
```
|
||||
|
||||
### Onnx Decoder Example
|
||||
|
||||
Make sure you have installed ```onnxruntime```
|
||||
|
||||
```bash
|
||||
python -m examples.onnx_example \
|
||||
--input_text "My name is Dave, and um, I'm from London" \
|
||||
--ref_codes samples/dave.pt \
|
||||
--ref_text samples/dave.txt
|
||||
```
|
||||
|
||||
To run the model with the onnx decoder you need to encode the reference sample. Please refer to the encode_reference example.
|
||||
|
||||
#### Encode reference
|
||||
You only need to provide a reference audio for the reference encoding.
|
||||
|
||||
```bash
|
||||
python -m examples.encode_reference \
|
||||
--ref_audio ./samples/dave.wav \
|
||||
--output_path encoded_reference.pt
|
||||
```
|
||||
|
||||
## Prepare References for Cloning
|
||||
## Preparing References for Cloning
|
||||
|
||||
NeuTTS Air requires two inputs:
|
||||
|
||||
@@ -183,6 +154,16 @@ For optimal performance, reference audio samples should be:
|
||||
5. **Clean** — minimal to no background noise
|
||||
6. **Natural, continuous speech** — like a monologue or conversation, with few pauses, so the model can capture tone effectively
|
||||
|
||||
## Guidelines for minimizing Latency
|
||||
|
||||
For optimal performance on-device:
|
||||
|
||||
1. Use the GGUF model backbones
|
||||
2. Pre-encode references
|
||||
3. Use the [onnx codec decoder](https://huggingface.co/neuphonic/neucodec-onnx-decoder)
|
||||
|
||||
Take a look at this example [examples README](examples/README.md###minimal-latency-example) to get started.
|
||||
|
||||
## Responsibility
|
||||
|
||||
Every audio file generated by NeuTTS Air includes [Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth).
|
||||
|
||||
38
examples/README.md
Normal file
38
examples/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Examples
|
||||
|
||||
### GGUF Backbones
|
||||
|
||||
To run the model with `llama-cpp-python` in GGUF format, select a GGUF backbone when intializing the example script.
|
||||
|
||||
```bash
|
||||
python -m examples.basic_example \
|
||||
--input_text "My name is Dave, and um, I'm from London" \
|
||||
--ref_audio ./samples/dave.wav \
|
||||
--ref_text ./samples/dave.txt \
|
||||
--backbone neuphonic/neutts-air-q4-gguf
|
||||
```
|
||||
|
||||
### Pre-encode a reference
|
||||
|
||||
Reference encoding can be done ahead of time to reduce latency whilst inferencing the model; to pre-encode a reference you only need to provide a reference audio, as in the following script:
|
||||
|
||||
```bash
|
||||
python -m examples.encode_reference \
|
||||
--ref_audio ./samples/dave.wav \
|
||||
--output_path encoded_reference.pt
|
||||
```
|
||||
|
||||
### Minimal Latency Example
|
||||
|
||||
To take advantage of encoding references ahead of time, we have a compiled the codec decoder into an [onnx graph](https://huggingface.co/neuphonic/neucodec-onnx-decoder) that enables inferencing NeuTTS-Air without loading the encoder.
|
||||
This can be useful for running the model in resource-constrained environments where the encoder may add a large amount of extra latency/memory usage.
|
||||
|
||||
To test the decoder, make sure you have installed ```onnxruntime``` and run the following:
|
||||
|
||||
```bash
|
||||
python -m examples.onnx_example \
|
||||
--input_text "My name is Dave, and um, I'm from London" \
|
||||
--ref_codes samples/dave.pt \
|
||||
--ref_text samples/dave.txt \
|
||||
--backbone neuphonic/neutts-air-q4-gguf
|
||||
```
|
||||
Reference in New Issue
Block a user