mirror of
https://github.com/ahmetoner/whisper-asr-webservice.git
synced 2023-04-14 03:48:29 +03:00
Update README.md
This commit is contained in:
94
README.md
94
README.md
@@ -1 +1,93 @@
|
||||
# whisper-webservice
|
||||
# Whisper Webservice
|
||||
|
||||
The webservice will be available soon.
|
||||
|
||||
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
|
||||
|
||||
## Docker Setup
|
||||
|
||||
The docker image will be available soon
|
||||
|
||||
## Setup
|
||||
|
||||
We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies
|
||||
|
||||
pip install git+https://github.com/openai/whisper.git
|
||||
|
||||
It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers:
|
||||
|
||||
```bash
|
||||
# on Ubuntu or Debian
|
||||
sudo apt update && sudo apt install ffmpeg
|
||||
|
||||
# on MacOS using Homebrew (https://brew.sh/)
|
||||
brew install ffmpeg
|
||||
|
||||
# on Windows using Chocolatey (https://chocolatey.org/)
|
||||
choco install ffmpeg
|
||||
```
|
||||
|
||||
## Command-line usage
|
||||
|
||||
The following command will transcribe speech in audio files, using the `medium` model:
|
||||
|
||||
whisper audio.flac audio.mp3 audio.wav --model medium
|
||||
|
||||
The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
|
||||
|
||||
whisper japanese.wav --language Japanese
|
||||
|
||||
Adding `--task translate` will translate the speech into English:
|
||||
|
||||
whisper japanese.wav --language Japanese --task translate
|
||||
|
||||
Run the following to view all available options:
|
||||
|
||||
whisper --help
|
||||
|
||||
See [tokenizer.py](whisper/tokenizer.py) for the list of all available languages.
|
||||
|
||||
|
||||
## Python usage
|
||||
|
||||
Transcription can also be performed within Python:
|
||||
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("base")
|
||||
result = model.transcribe("audio.mp3")
|
||||
print(result["text"])
|
||||
```
|
||||
|
||||
Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
|
||||
|
||||
Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.
|
||||
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("base")
|
||||
|
||||
# load audio and pad/trim it to fit 30 seconds
|
||||
audio = whisper.load_audio("audio.mp3")
|
||||
audio = whisper.pad_or_trim(audio)
|
||||
|
||||
# make log-Mel spectrogram and move to the same device as the model
|
||||
mel = whisper.log_mel_spectrogram(audio).to(model.device)
|
||||
|
||||
# detect the spoken language
|
||||
_, probs = model.detect_language(mel)
|
||||
print(f"Detected language: {max(probs, key=probs.get)}")
|
||||
|
||||
# decode the audio
|
||||
options = whisper.DecodingOptions()
|
||||
result = whisper.decode(model, mel, options)
|
||||
|
||||
# print the recognized text
|
||||
print(result.text)
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
The code and the model weights of Whisper are released under the MIT License. See [LICENSE](LICENSE) for further details.
|
||||
|
||||
Reference in New Issue
Block a user