# Coqui Xtts > We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python`dataclasses`. Here is how a simple configuration looks --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/configuration.md # Configuration We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. ```python from dataclasses import asdict, dataclass, field from typing import List, Union from coqpit.coqpit import MISSING, Coqpit, check_argument @dataclass class SimpleConfig(Coqpit): val_a: int = 10 val_b: int = None val_d: float = 10.21 val_c: str = "Coqpit is great!" vol_e: bool = True # mandatory field # raise an error when accessing the value if it is not changed. It is a way to define val_k: int = MISSING # optional field val_dict: dict = field(default_factory=lambda: {"val_aa": 10, "val_ss": "This is in a dict."}) # list of list val_listoflist: List[List] = field(default_factory=lambda: [[1, 2], [3, 4]]) val_listofunion: List[List[Union[str, int, bool]]] = field( default_factory=lambda: [[1, 3], [1, "Hi!"], [True, False]] ) def check_values( self, ): # you can define explicit constraints manually or by`check_argument()` """Check config fields""" c = asdict(self) # avoid unexpected changes on `self` check_argument("val_a", c, restricted=True, min_val=10, max_val=2056) check_argument("val_b", c, restricted=True, min_val=128, max_val=4058, allow_none=True) check_argument("val_c", c, restricted=True) ``` In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime. It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments. The general configuration hierarchy looks like below: ``` ModelConfig() | | -> ... # model specific configurations | -> ModelArgs() # model class arguments | -> BaseDatasetConfig() # only for tts models | -> BaseXModelConfig() # Generic fields for `tts` and `vocoder` models. | | -> BaseTrainingConfig() # trainer fields | -> BaseAudioConfig() # audio processing fields ``` In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model. We host pre-defined model configurations under ```TTS//configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided. --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/contributing.md ```{include} ../../CONTRIBUTING.md :relative-images: ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/docker_images.md (docker_images)= ## Docker images We provide docker images to be able to test TTS without having to setup your own environment. ### Using premade images You can use premade images built automatically from the latest TTS version. #### CPU version ```bash docker pull ghcr.io/coqui-ai/tts-cpu ``` #### GPU version ```bash docker pull ghcr.io/coqui-ai/tts ``` ### Building your own image ```bash docker build -t tts . ``` ## Basic inference Basic usage: generating an audio file from a text passed as argument. You can pass any tts argument after the image name. ### CPU version ```bash docker run --rm -v ~/tts-output:/root/tts-output ghcr.io/coqui-ai/tts-cpu --text "Hello." --out_path /root/tts-output/hello.wav ``` ### GPU version For the GPU version, you need to have the latest NVIDIA drivers installed. With `nvidia-smi` you can check the CUDA version supported, it must be >= 11.8 ```bash docker run --rm --gpus all -v ~/tts-output:/root/tts-output ghcr.io/coqui-ai/tts --text "Hello." --out_path /root/tts-output/hello.wav --use_cuda true ``` ## Start a server Starting a TTS server: Start the container and get a shell inside it. ### CPU version ```bash docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu python3 TTS/server/server.py --list_models #To get the list of available models python3 TTS/server/server.py --model_name tts_models/en/vctk/vits ``` ### GPU version ```bash docker run --rm -it -p 5002:5002 --gpus all --entrypoint /bin/bash ghcr.io/coqui-ai/tts python3 TTS/server/server.py --list_models #To get the list of available models python3 TTS/server/server.py --model_name tts_models/en/vctk/vits --use_cuda true ``` Click [there](http://[::1]:5002/) and have fun with the server! --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/faq.md # Humble FAQ We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper. ## Errors with a pre-trained model. How can I resolve this? - Make sure you use the right commit version of 🐸TTS. Each pre-trained model has its corresponding version that needs to be used. It is defined on the model table. - If it is still problematic, post your problem on [Discussions](https://github.com/coqui-ai/TTS/discussions). Please give as many details as possible (error message, your TTS version, your TTS model and config.json etc.) - If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny. ## What are the requirements of a good 🐸TTS dataset? * {ref}`See this page ` ## How should I choose the right model? - First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2. - Tacotron models produce the most natural voice if your dataset is not too noisy. - If both models do not perform well and especially the attention does not align, then try AlignTTS or GlowTTS. - If you need faster models, consider SpeedySpeech, GlowTTS or AlignTTS. Keep in mind that SpeedySpeech requires a pre-trained Tacotron or Tacotron2 model to compute text-to-speech alignments. ## How can I train my own `tts` model? 0. Check your dataset with notebooks in [dataset_analysis](https://github.com/coqui-ai/TTS/tree/master/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/coqui-ai/TTS/blob/master/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis. 1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech. A `formatter` parses the metadata file and converts a list of training samples. 2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```. - If you use phonemes for training and your language is supported [here](https://github.com/rhasspy/gruut#supported-languages), you don't need to set your character list. - You can use `TTS/bin/find_unique_chars.py` to get characters used in your dataset. 3. Write your own text cleaner in ```utils.text.cleaners```. It is not always necessary, except when you have a different alphabet or language-specific requirements. - A `cleaner` performs number and abbreviation expansion and text normalization. Basically, it converts the written text to its spoken format. - If you go lazy, you can try using ```basic_cleaners```. 4. Fill in a ```config.json```. Go over each parameter one by one and consider it regarding the appended explanation. - Check the `Coqpit` class created for your target model. Coqpit classes for `tts` models are under `TTS/tts/configs/`. - You just need to define fields you need/want to change in your `config.json`. For the rest, their default values are used. - 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases. - Here is a sample `config.json` for training a `GlowTTS` network. ```json { "model": "glow_tts", "batch_size": 32, "eval_batch_size": 16, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": -1, "epochs": 1000, "text_cleaner": "english_cleaners", "use_phonemes": false, "phoneme_language": "en-us", "phoneme_cache_path": "phoneme_cache", "print_step": 25, "print_eval": true, "mixed_precision": false, "output_path": "recipes/ljspeech/glow_tts/", "test_sentences": ["Test this sentence.", "This test sentence.", "Sentence this test."], "datasets":[{"formatter": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}] } ``` 6. Train your model. - SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json``` - MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json``` **Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```. ## How can I train in a different language? - Check steps 2, 3, 4, 5 above. ## How can I train multi-GPUs? - Check step 5 above. ## How can I check model performance? - You can inspect model training and performance using ```tensorboard```. It will show you loss, attention alignment, model output. Go with the order below to measure the model performance. 1. Check ground truth spectrograms. If they do not look as they are supposed to, then check audio processing parameters in ```config.json```. 2. Check train and eval losses and make sure that they all decrease smoothly in time. 3. Check model spectrograms. Especially, training outputs should look similar to ground truth spectrograms after ~10K iterations. 4. Your model would not work well at test time until the attention has a near diagonal alignment. This is the sublime art of TTS training. - Attention should converge diagonally after ~50K iterations. - If attention does not converge, the probabilities are; - Your dataset is too noisy or small. - Samples are too long. - Batch size is too small (batch_size < 32 would be having a hard time converging) - You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'. - 'bidirectional_decoder' is your ultimate savior, but it trains 2x slower and demands 1.5x more GPU memory. - You can also try the other models like AlignTTS or GlowTTS. ## How do I know when to stop training? There is no single objective metric to decide the end of a training since the voice quality is a subjective matter. In our model trainings, we follow these steps; - Check test time audio outputs, if it does not improve more. - Check test time attention maps, if they look clear and diagonal. - Check validation loss, if it converged and smoothly went down or started to overfit going up. - If the answer is YES for all of the above, then test the model with a set of complex sentences. For English, you can use the `TestAttention` notebook. Keep in mind that the approach above only validates the model robustness. It is hard to estimate the voice quality without asking the actual people. The best approach is to pick a set of promising models and run a Mean-Opinion-Score study asking actual people to score the models. ## My model does not learn. How can I debug? - Go over the steps under "How can I check model performance?" ## Attention does not align. How can I make it work? - Check the 4th step under "How can I check model performance?" ## How can I test a trained model? - The best way is to use `tts` or `tts-server` commands. For details check {ref}`here `. - If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class. ## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work. - In general, all of the above relates to the `stopnet`. It is the part of the model telling the `decoder` when to stop. - In general, a poor `stopnet` relates to something else that is broken in your model or dataset. Especially the attention module. - One common reason is the silent parts in the audio clips at the beginning and the ending. Check ```trim_db``` value in the config. You can find a better value for your dataset by using ```CheckSpectrogram``` notebook. If this value is too small, too much of the audio will be trimmed. If too big, then too much silence will remain. Both will curtail the `stopnet` performance. --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/finetuning.md # Fine-tuning a 🐸 TTS model ## Fine-tuning Fine-tuning takes a pre-trained model and retrains it to improve the model performance on a different task or dataset. In 🐸TTS we provide different pre-trained models in different languages and different pros and cons. You can take one of them and fine-tune it for your own dataset. This will help you in two main ways: 1. Faster learning Since a pre-trained model has already learned features that are relevant for the task, it will converge faster on a new dataset. This will reduce the cost of training and let you experiment faster. 2. Better results with small datasets Deep learning models are data hungry and they give better performance with more data. However, it is not always possible to have this abundance, especially in specific domains. For instance, the LJSpeech dataset, that we released most of our English models with, is almost 24 hours long. It takes weeks to record this amount of data with the help of a voice actor. Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own speech dataset and achieve reasonable results with only a couple of hours of data. However, note that, fine-tuning does not ensure great results. The model performance still depends on the {ref}`dataset quality ` and the hyper-parameters you choose for fine-tuning. Therefore, it still takes a bit of tinkering. ## Steps to fine-tune a 🐸 TTS model 1. Setup your dataset. You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the training. Please see {ref}`this page ` for more information about formatting. 2. Choose the model you want to fine-tune. You can list the available models in the command line with ```bash tts --list_models ``` The command above lists the models in a naming format as ```///```. Or you can manually check the `.model.json` file in the project directory. You should choose the model based on your requirements. Some models are fast and some are better in speech quality. One lazy way to test a model is running the model on the hardware you want to use and see how it works. For simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here `. 3. Download the model. You can download the model by using the `tts` command. If you run `tts` with a particular model, it will download it automatically and the model path will be printed on the terminal. ```bash tts --model_name tts_models/es/mai/tacotron2-DDC --text "Ola." > Downloading model to /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts ... ``` In the example above, we called the Spanish Tacotron model and give the sample output showing use the path where the model is downloaded. 4. Setup the model config for fine-tuning. You need to change certain fields in the model config. You have 3 options for playing with the configuration. 1. Edit the fields in the ```config.json``` file if you want to use ```TTS/bin/train_tts.py``` to train the model. 2. Edit the fields in one of the training scripts in the ```recipes``` directory if you want to use python. 3. Use the command-line arguments to override the fields like ```--coqpit.lr 0.00001``` to change the learning rate. Some of the important fields are as follows: - `datasets` field: This is set to the dataset you want to fine-tune the model on. - `run_name` field: This is the name of the run. This is used to name the output directory and the entry in the logging dashboard. - `output_path` field: This is the path where the fine-tuned model is saved. - `lr` field: You may need to use a smaller learning rate for fine-tuning to not lose the features learned by the pre-trained model with big update steps. - `audio` fields: Different datasets have different audio characteristics. You must check the current audio parameters and make sure that the values reflect your dataset. For instance, your dataset might have a different audio sampling rate. Apart from the parameters above, you should check the whole configuration file and make sure that the values are correct for your dataset and training. 5. Start fine-tuning. Whether you use one of the training scripts under ```recipes``` folder or the ```train_tts.py``` to start your training, you should use the ```--restore_path``` flag to specify the path to the pre-trained model. ```bash CUDA_VISIBLE_DEVICES="0" python recipes/ljspeech/glow_tts/train_glowtts.py \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth ``` ```bash CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py \ --config_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/config.json \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth ``` As stated above, you can also use command-line arguments to change the model configuration. ```bash CUDA_VISIBLE_DEVICES="0" python recipes/ljspeech/glow_tts/train_glowtts.py \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth --coqpit.run_name "glow-tts-finetune" \ --coqpit.lr 0.00001 ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/formatting_your_dataset.md (formatting_your_dataset)= # Formatting Your Dataset For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription. If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format. Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`. ``` /wavs | - audio1.wav | - audio2.wav | - audio3.wav ... ``` You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text. We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc. ``` # metadata.txt audio1|This is my sentence.|This is my sentence. audio2|1469 and 1470|fourteen sixty-nine and fourteen seventy audio3|It'll be $16 sir.|It'll be sixteen dollars sir. ... ``` *If you don't have normalized transcriptions, you can use the same transcription for both columns. If it's your case, we recommend to use normalization later in the pipeline, either in the text cleaner or in the phonemizer.* In the end, we have the following folder structure ``` /MyTTSDataset | | -> metadata.txt | -> /wavs | -> audio1.wav | -> audio2.wav | ... ``` The format above is taken from widely-used the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. You can also download and see the dataset. 🐸TTS already provides tooling for the LJSpeech. if you use the same format, you can start training your models right away. ## Dataset Quality Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English. For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset). ## Using Your Dataset in 🐸TTS After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format). If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`. If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`. What you get out of a `formatter` is a `List[Dict]` in the following format. ``` >>> formatter(metafile_path) [ {"audio_file":"audio1.wav", "text":"This is my sentence.", "speaker_name":"MyDataset", "language": "lang_code"}, {"audio_file":"audio1.wav", "text":"This is maybe a sentence.", "speaker_name":"MyDataset", "language": "lang_code"}, ... ] ``` Each sub-list is parsed as ```{"", "", "]```. `````` is the dataset name for single speaker datasets and it is mainly used in the multi-speaker models to map the speaker of the each sample. But for now, we only focus on single speaker datasets. The purpose of a `formatter` is to parse your manifest file and load the audio file paths and transcriptions. Then, the output is passed to the `Dataset`. It computes features from the audio signals, calls text normalization routines, and converts raw text to phonemes if needed. ## Loading your dataset Load one of the dataset supported by 🐸TTS. ```python from TTS.tts.configs.shared_configs import BaseDatasetConfig from TTS.tts.datasets import load_tts_samples # dataset config for one of the pre-defined datasets dataset_config = BaseDatasetConfig( formatter="vctk", meta_file_train="", language="en-us", path="dataset-path") ) # load training samples train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True) ``` Load a custom dataset with a custom formatter. ```python from TTS.tts.datasets import load_tts_samples # custom formatter implementation def formatter(root_path, manifest_file, **kwargs): # pylint: disable=unused-argument """Assumes each line as ```|``` """ txt_file = os.path.join(root_path, manifest_file) items = [] speaker_name = "my_speaker" with open(txt_file, "r", encoding="utf-8") as ttf: for line in ttf: cols = line.split("|") wav_file = os.path.join(root_path, "wavs", cols[0]) text = cols[1] items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name, "root_path": root_path}) return items # load training samples train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, formatter=formatter) ``` See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `tts` models. See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models. See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a `Dataset` implementation. Feel free to add things as you need. --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/index.md ```{include} ../../README.md :relative-images: ``` ---- # Documentation Content ```{eval-rst} .. toctree:: :maxdepth: 2 :caption: Get started tutorial_for_nervous_beginners installation faq contributing .. toctree:: :maxdepth: 2 :caption: Using 🐸TTS inference docker_images implementing_a_new_model implementing_a_new_language_frontend training_a_model finetuning configuration formatting_your_dataset what_makes_a_good_dataset tts_datasets marytts .. toctree:: :maxdepth: 2 :caption: Main Classes main_classes/trainer_api main_classes/audio_processor main_classes/model_api main_classes/dataset main_classes/gan main_classes/speaker_manager .. toctree:: :maxdepth: 2 :caption: `tts` Models models/glow_tts.md models/vits.md models/forward_tts.md models/tacotron1-2.md models/overflow.md models/tortoise.md models/bark.md models/xtts.md .. toctree:: :maxdepth: 2 :caption: `vocoder` Models ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/inference.md (synthesizing_speech)= # Synthesizing Speech First, you need to install TTS. We recommend using PyPi. You need to call the command below: ```bash $ pip install TTS ``` After the installation, 2 terminal commands are available. 1. TTS Command Line Interface (CLI). - `tts` 2. Local Demo Server. - `tts-server` 3. In 🐍Python. - `from TTS.api import TTS` ## On the Commandline - `tts` ![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif) After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS. Listing released 🐸TTS models. ```bash tts --list_models ``` Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.) ```bash tts --text "Text for TTS" \ --model_name "///" \ --out_path folder/to/save/output.wav ``` Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. ```bash tts --text "Text for TTS" \ --model_name "tts_models///" \ --vocoder_name "vocoder_models///" \ --out_path folder/to/save/output.wav ``` Run your own TTS model (Using Griffin-Lim Vocoder) ```bash tts --text "Text for TTS" \ --model_path path/to/model.pth \ --config_path path/to/config.json \ --out_path folder/to/save/output.wav ``` Run your own TTS and Vocoder models ```bash tts --text "Text for TTS" \ --config_path path/to/config.json \ --model_path path/to/model.pth \ --out_path folder/to/save/output.wav \ --vocoder_path path/to/vocoder.pth \ --vocoder_config_path path/to/vocoder_config.json ``` Run a multi-speaker TTS model from the released models list. ```bash tts --model_name "tts_models///" --list_speaker_idxs # list the possible speaker IDs. tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "tts_models///" --speaker_idx "" ``` Run a released voice conversion model ```bash tts --model_name "voice_conversion///" --source_wav "my/source/speaker/audio.wav" --target_wav "my/target/speaker/audio.wav" --out_path folder/to/save/output.wav ``` **Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder. ## On the Demo Server - `tts-server` ![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif) You can boot up a demo 🐸TTS server to run an inference with your models. Note that the server is not optimized for performance but gives you an easy way to interact with the models. The demo server provides pretty much the same interface as the CLI command. ```bash tts-server -h # see the help tts-server --list_models # list the available models. ``` Run a TTS model, from the release models list, with its default vocoder. If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize speech. ```bash tts-server --model_name "///" ``` Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. ```bash tts-server --model_name "///" \ --vocoder_name "///" ``` ## Python 🐸TTS API You can run a multi-speaker and multi-lingual model in Python as ```python import torch from TTS.api import TTS # Get device device = "cuda" if torch.cuda.is_available() else "cpu" # List available 🐸TTS models print(TTS().list_models()) # Init TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) # Run TTS # ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language # Text to speech list of amplitude values as output wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en") # Text to speech to a file tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") ``` #### Here is an example for a single speaker model. ```python # Init TTS with the target model name tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False) # Run TTS tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH) ``` #### Example voice cloning with YourTTS in English, French and Portuguese: ```python tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda") tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr", file_path="output.wav") tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav") ``` #### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav` ```python tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda") tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav") ``` #### Example voice cloning by a single speaker TTS model combining with the voice conversion model. This way, you can clone voices by using any model in 🐸TTS. ```python tts = TTS("tts_models/de/thorsten/tacotron2-DDC") tts.tts_with_vc_to_file( "Wie sage ich auf Italienisch, dass ich dich liebe?", speaker_wav="target/speaker.wav", file_path="ouptut.wav" ) ``` #### Example text to speech using **Fairseq models in ~1100 languages** 🤯. For these models use the following name format: `tts_models//fairseq/vits`. You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). ```python from TTS.api import TTS api = TTS(model_name="tts_models/eng/fairseq/vits").to("cuda") api.tts_to_file("This is a test.", file_path="output.wav") # TTS with on the fly voice conversion api = TTS("tts_models/deu/fairseq/vits") api.tts_with_vc_to_file( "Wie sage ich auf Italienisch, dass ich dich liebe?", speaker_wav="target/speaker.wav", file_path="ouptut.wav" ) ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/installation.md # Installation 🐸TTS supports python >=3.7 <3.11.0 and tested on Ubuntu 18.10, 19.10, 20.10. ## Using `pip` `pip` is recommended if you want to use 🐸TTS only for inference. You can install from PyPI as follows: ```bash pip install TTS # from PyPI ``` Or install from Github: ```bash pip install git+https://github.com/coqui-ai/TTS # from Github ``` ## Installing From Source This is recommended for development and more control over 🐸TTS. ```bash git clone https://github.com/coqui-ai/TTS/ cd TTS make system-deps # only on Linux systems. make install ``` ## On Windows If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](https://stackoverflow.com/questions/66726331/ --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/main_classes/model_api.md # Model API Model API provides you a set of functions that easily make your model compatible with the `Trainer`, `Synthesizer` and `ModelZoo`. ## Base TTS Model ```{eval-rst} .. autoclass:: TTS.model.BaseTrainerModel :members: ``` ## Base tts Model ```{eval-rst} .. autoclass:: TTS.tts.models.base_tts.BaseTTS :members: ``` ## Base vocoder Model ```{eval-rst} .. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder :members: ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/main_classes/speaker_manager.md # Speaker Manager API The {class}`TTS.tts.utils.speakers.SpeakerManager` organize speaker related data and information for 🐸TTS models. It is especially useful for multi-speaker models. ## Speaker Manager ```{eval-rst} .. automodule:: TTS.tts.utils.speakers :members: ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/main_classes/trainer_api.md # Trainer API We made the trainer a separate project on https://github.com/coqui-ai/Trainer --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/models/xtts.md # ⓍTTS ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours. This is the same model that powers [Coqui Studio](https://coqui.ai/), and [Coqui API](https://docs.coqui.ai/docs), however we apply a few tricks to make it faster and support streaming inference. ### Features - Voice cloning. - Cross-language voice cloning. - Multi-lingual speech generation. - 24khz sampling rate. - Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference)) - Fine-tuning support. (See [Training](#training)) ### Updates with v2 - Improved voice cloning. - Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime. - 2 new languages: Hungarian and Korean. - Across the board quality improvements. ### Code Current implementation only supports inference and GPT encoder training. ### Languages As of now, XTTS-v2 supports 16 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko). Stay tuned as we continue to add support for more languages. If you have any language requests, please feel free to reach out. ### License This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). ### Contact Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Twitter](https://twitter.com/coqui_ai). You can also mail us at info@coqui.ai. ### Inference #### 🐸TTS Command line You can check all supported languages with the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --list_language_idx ``` You can check all Coqui available speakers with the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --list_speaker_idx ``` ##### Coqui speakers You can do inference using one of the available speakers using the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent." \ --speaker_idx "Ana Florence" \ --language_idx en \ --use_cuda true ``` ##### Clone a voice You can clone a speaker voice using a single or multiple references: ###### Single reference ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/speaker.wav \ --language_idx tr \ --use_cuda true ``` ###### Multiple references ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/speaker.wav /path/to/target/speaker_2.wav /path/to/target/speaker_3.wav \ --language_idx tr \ --use_cuda true ``` or for all wav files in a directory you can use: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/*.wav \ --language_idx tr \ --use_cuda true ``` #### 🐸TTS API ##### Clone a voice You can clone a speaker voice using a single or multiple references: ###### Single reference Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio. You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit. ```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker_wav=["/path/to/target/speaker.wav"], language="en", split_sentences=True ) ``` ###### Multiple references You can pass multiple audio files to the `speaker_wav` argument for better voice cloning. ```python from TTS.api import TTS # using the default version set in 🐸TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # using a specific version # 👀 see the branch names for versions on https://huggingface.co/coqui/XTTS-v2/tree/main # ❗some versions might be incompatible with the API tts = TTS("xtts_v2.0.2", gpu=True) # getting the latest XTTS_v2 tts = TTS("xtts", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"], language="en") ``` ##### Coqui speakers You can do inference using one of the available speakers using the following code: ```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker="Ana Florence", language="en", split_sentences=True ) ``` #### 🐸TTS Model API To use the model API, you need to download the model files and pass config and model file paths manually. #### Manual Inference If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first. ```console pip install deepspeed==0.10.3 ``` ##### inference parameters - `text`: The text to be synthesized. - `language`: The language of the text to be synthesized. - `gpt_cond_latent`: The latent vector you get with get_conditioning_latents. (You can cache for faster inference with same speaker) - `speaker_embedding`: The speaker embedding you get with get_conditioning_latents. (You can cache for faster inference with same speaker) - `temperature`: The softmax temperature of the autoregressive model. Defaults to 0.65. - `length_penalty`: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0. - `repetition_penalty`: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. Defaults to 2.0. - `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50. - `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8. - `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0) - `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True. ##### Inference ```python import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts print("Loading model...") config = XttsConfig() config.load_json("/path/to/xtts/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"]) print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, temperature=0.7, # Add custom parameters here ) torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` ##### Streaming manually Here the goal is to stream the audio as it is being generated. This is useful for real-time applications. Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster. ```python import os import time import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts print("Loading model...") config = XttsConfig() config.load_json("/path/to/xtts/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"]) print("Inference...") t0 = time.time() chunks = model.inference_stream( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding ) wav_chuncks = [] for i, chunk in enumerate(chunks): if i == 0: print(f"Time to first chunck: {time.time() - t0}") print(f"Received chunk {i} of audio length {chunk.shape[-1]}") wav_chuncks.append(chunk) wav = torch.cat(wav_chuncks, dim=0) torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000) ``` ### Training #### Easy training To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps: - Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter - Train the XTTS GPT encoder with the processed data - Inference support using the fine-tuned model The user can run this gradio demo locally or remotely using a Colab Notebook. ##### Run demo on Colab To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook. The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing). To learn how to use this Colab Notebook please check the [XTTS fine-tuning video](). If you are not able to acess the video you need to follow the steps: 1. Open the Colab notebook and start the demo by runining the first two cells (ignore pip install errors in the first one). 2. Click on the link "Running on public URL:" on the second cell output. 3. On the first Tab (1 - Data processing) you need to select the audio file or files, wait for upload, and then click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done. 4. Soon as the dataset processing is done you need to go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. Note that it can take up to 40 minutes. 5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference". ##### Run demo locally To run the demo locally you need to do the following steps: 1. Install 🐸 TTS following the instructions available [here](https://tts.readthedocs.io/en/dev/installation.html#installation). 2. Install the Gradio demo requirements with the command `python3 -m pip install -r TTS/demos/xtts_ft_demo/requirements.txt` 3. Run the Gradio demo using the command `python3 TTS/demos/xtts_ft_demo/xtts_demo.py` 4. Follow the steps presented in the [tutorial video](https://www.youtube.com/watch?v=8tpDiiouGxc&feature=youtu.be) to be able to fine-tune and test the fine-tuned model. If you are not able to access the video, here is what you need to do: 1. On the first Tab (1 - Data processing) select the audio file or files, wait for upload 2. Click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done. 3. Go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. it will take some time. 4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. 5. Now you can run inference with the model by clicking on the button "Step 4 - Inference". #### Advanced training A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py You need to change the fields of the `BaseDatasetConfig` to match your dataset and then update `GPTArgs` and `GPTTrainerConfig` fields as you need. By default, it will use the same parameters that XTTS v1.1 model was trained with. To speed up the model convergence, as default, it will also download the XTTS v1.1 checkpoint and load it. After training you can do inference following the code bellow. ```python import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Add here the xtts_config path CONFIG_PATH = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT-October-23-2023_10+36AM-653f2e75/config.json" # Add here the vocab file that you have used to train the model TOKENIZER_PATH = "recipes/ljspeech/xtts_v1/run/training/XTTS_v2_original_model_files/vocab.json" # Add here the checkpoint that you want to do inference with XTTS_CHECKPOINT = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT/best_model.pth" # Add here the speaker reference SPEAKER_REFERENCE = "LjSpeech_reference.wav" # output wav path OUTPUT_WAV_PATH = "xtts-ft.wav" print("Loading model...") config = XttsConfig() config.load_json(CONFIG_PATH) model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE]) print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, temperature=0.7, # Add custom parameters here ) torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` ## References and Acknowledgements - VallE: https://arxiv.org/abs/2301.02111 - Tortoise Repo: https://github.com/neonbjb/tortoise-tts - Faster implementation: https://github.com/152334H/tortoise-tts-fast - Univnet: https://arxiv.org/abs/2106.07889 - Latent Diffusion:https://arxiv.org/abs/2112.10752 - DALL-E: https://arxiv.org/abs/2102.12092 - Perceiver: https://arxiv.org/abs/2103.03206 ## XttsConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.xtts_config.XttsConfig :members: ``` ## XttsArgs ```{eval-rst} .. autoclass:: TTS.tts.models.xtts.XttsArgs :members: ``` ## XTTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.xtts.XTTS :members: ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/training_a_model.md # Training a Model 1. Decide the model you want to use. Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model serves your needs. Other than referring to the papers, one easy way is to test the 🐸TTS community models and see how fast and good each of the models. Or you can start a discussion on our communication channels. 2. Understand the configuration, its fields and values. For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it. 3. Check the recipes. Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`. ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py ``` You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig` fields as you need. 4. Run the training. ```bash $ CUDA_VISIBLE_DEVICES="0" python train_glowtts.py ``` Notice that we set the GPU for the training by `CUDA_VISIBLE_DEVICES` environment variable. To see available GPUs on your system, you can use `nvidia-smi` command on the terminal. If you like to run a multi-gpu training using DDP back-end, ```bash $ CUDA_VISIBLE_DEVICES="0, 1, 2" python -m trainer.distribute --script /train_glowtts.py ``` The example above runs a multi-gpu training using GPUs `0, 1, 2`. Beginning of a training log looks like this: ```console > Experiment folder: /your/output_path/-Juni-23-2021_02+52-78899209 > Using CUDA: True > Number of GPUs: 1 > Setting up Audio Processor... | > sample_rate:22050 | > resample:False | > num_mels:80 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:None | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:45 | > do_sound_norm:False | > stats_path:None | > base:10 | > hop_length:256 | > win_length:1024 | > Found 13100 files in /your/dataset/path/ljspeech/LJSpeech-1.1 > Using model: glow_tts > Model has 28356129 parameters > EPOCH: 0/1000 > DataLoader initialization | > Use phonemes: False | > Number of instances : 12969 | > Max length sequence: 187 | > Min length sequence: 5 | > Avg length sequence: 98.3403500655409 | > Num. instances discarded by max-min (max=500, min=3) seq limits: 0 | > Batch group size: 0. > TRAINING (2021-06-23 14:52:54) --> STEP: 0/405 -- GLOBAL_STEP: 0 | > loss: 2.34670 | > log_mle: 1.61872 | > loss_dur: 0.72798 | > align_error: 0.52744 | > current_lr: 2.5e-07 | > grad_norm: 5.036039352416992 | > step_time: 5.8815 | > loader_time: 0.0065 ... ``` 5. Run the Tensorboard. ```bash $ tensorboard --logdir= ``` 6. Monitor the training progress. On the terminal and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs. Note that different models have different metrics, visuals and outputs. You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions that occur in a training. 7. Use your best model for inference. Use `tts` or `tts-server` commands for testing your models. ```bash $ tts --text "Text for TTS" \ --model_path path/to/checkpoint_x.pth \ --config_path path/to/config.json \ --out_path folder/to/save/output.wav ``` 8. Return to the step 1 and reiterate for training a `vocoder` model. In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models. # Multi-speaker Training Training a multi-speaker model is mostly the same as training a single-speaker model. You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model. The configuration parameters define whether you want to train the model with a speaker-embedding layer or pre-computed d-vectors. For using d-vectors, you first need to compute the d-vectors using the `SpeakerEncoder`. The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below. ```{literalinclude} ../../recipes/vctk/glow_tts/train_glow_tts.py ``` --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/tts_datasets.md # TTS Datasets Some of the known public datasets that we successfully applied 🐸TTS: - [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/) - [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) - [English - TWEB](https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset) - [English - LibriTTS](https://openslr.org/60/) - [English - VCTK](https://datashare.ed.ac.uk/handle/10283/2950) - [Multilingual - M-AI-Labs](http://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) - [Spanish](https://drive.google.com/file/d/1Sm_zyBo67XHkiFhcRSQ4YaHPYM0slO_e/view?usp=sharing) - thx! @carlfm01 - [German - Thorsten OGVD](https://github.com/thorstenMueller/deep-learning-german-tts) - [Japanese - Kokoro](https://www.kaggle.com/kaiida/kokoro-speech-dataset-v11-small/version/1) - [Chinese](https://www.data-baker.com/data/index/source/) - [Ukrainian - LADA](https://github.com/egorsmkv/ukrainian-tts-datasets/tree/main/lada) Let us know if you use 🐸TTS on a different dataset. --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/tutorial_for_nervous_beginners.md # Tutorial For Nervous Beginners ## Installation User friendly installation. Recommended only for synthesizing voice. ```bash $ pip install TTS ``` Developer friendly installation. ```bash $ git clone https://github.com/coqui-ai/TTS $ cd TTS $ pip install -e . ``` ## Training a `tts` Model A breakdown of a simple script that trains a GlowTTS model on the LJspeech dataset. See the comments for more details. ### Pure Python Way 0. Download your dataset. In this example, we download and use the LJSpeech dataset. Set the download directory based on your preferences. ```bash $ python -c 'from TTS.utils.downloaders import download_ljspeech; download_ljspeech("../recipes/ljspeech/");' ``` 1. Define `train.py`. ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py ``` 2. Run the script. ```bash CUDA_VISIBLE_DEVICES=0 python train.py ``` - Continue a previous run. ```bash CUDA_VISIBLE_DEVICES=0 python train.py --continue_path path/to/previous/run/folder/ ``` - Fine-tune a model. ```bash CUDA_VISIBLE_DEVICES=0 python train.py --restore_path path/to/model/checkpoint.pth ``` - Run multi-gpu training. ```bash CUDA_VISIBLE_DEVICES=0,1,2 python -m trainer.distribute --script train.py ``` ### CLI Way We still support running training from CLI like in the old days. The same training run can also be started as follows. 1. Define your `config.json` ```json { "run_name": "my_run", "model": "glow_tts", "batch_size": 32, "eval_batch_size": 16, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": -1, "epochs": 1000, "text_cleaner": "english_cleaners", "use_phonemes": false, "phoneme_language": "en-us", "phoneme_cache_path": "phoneme_cache", "print_step": 25, "print_eval": true, "mixed_precision": false, "output_path": "recipes/ljspeech/glow_tts/", "datasets":[{"formatter": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}] } ``` 2. Start training. ```bash $ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json ``` ## Training a `vocoder` Model ```{literalinclude} ../../recipes/ljspeech/hifigan/train_hifigan.py ``` ❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above. ## Synthesizing Speech You can run `tts` and synthesize speech directly on the terminal. ```bash $ tts -h # see the help $ tts --list_models # list the available models. ``` ![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif) You can call `tts-server` to start a local demo server that you can open it on your favorite web browser and 🗣️. ```bash $ tts-server -h # see the help $ tts-server --list_models # list the available models. ``` ![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif) --- # Source: https://github.com/coqui-ai/TTS/blob/dev/docs/source/what_makes_a_good_dataset.md (what_makes_a_good_dataset)= # What makes a good TTS dataset ## What Makes a Good Dataset * **Gaussian like distribution on clip and text lengths**. So plot the distribution of clip lengths and check if it covers enough short and long voice clips. * **Mistake free**. Remove any wrong or broken files. Check annotations, compare transcript and audio length. * **Noise free**. Background noise might lead your model to struggle, especially for a good alignment. Even if it learns the alignment, the final result is likely to be suboptimial. * **Compatible tone and pitch among voice clips**. For instance, if you are using audiobook recordings for your project, it might have impersonations for different characters in the book. These differences between samples downgrade the model performance. * **Good phoneme coverage**. Make sure that your dataset covers a good portion of the phonemes, di-phonemes, and in some languages tri-phonemes. * **Naturalness of recordings**. For your model WISIAIL (What it sees is all it learns). Therefore, your dataset should accommodate all the attributes you want to hear from your model. ## Preprocessing Dataset If you like to use a bespoken dataset, you might like to perform a couple of quality checks before training. 🐸TTS provides a couple of notebooks (CheckSpectrograms, AnalyzeDataset) to expedite this part for you. * **AnalyzeDataset** is for checking dataset distribution in terms of the clip and transcript lengths. It is good to find outlier instances (too long, short text but long voice clip, etc.)and remove them before training. Keep in mind that we like to have a good balance between long and short clips to prevent any bias in training. If you have only short clips (1-3 secs), then your model might suffer for long sentences and if your instances are long, then it might not learn the alignment or might take too long to train the model. * **CheckSpectrograms** is to measure the noise level of the clips and find good audio processing parameters. The noise level might be observed by checking spectrograms. If spectrograms look cluttered, especially in silent parts, this dataset might not be a good candidate for a TTS project. If your voice clips are too noisy in the background, it makes things harder for your model to learn the alignment, and the final result might be different than the voice you are given. If the spectrograms look good, then the next step is to find a good set of audio processing parameters, defined in ```config.json```. In the notebook, you can compare different sets of parameters and see the resynthesis results in relation to the given ground-truth. Find the best parameters that give the best possible synthesis performance. Another practical detail is the quantization level of the clips. If your dataset has a very high bit-rate, that might cause slow data-load time and consequently slow training. It is better to reduce the sample-rate of your dataset to around 16000-22050.