# Coqui Tts > We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python`dataclasses`. Here is how a simple configuration looks --- # Configuration We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. ```python from dataclasses import asdict, dataclass, field from typing import List, Union from coqpit.coqpit import MISSING, Coqpit, check_argument @dataclass class SimpleConfig(Coqpit): val_a: int = 10 val_b: int = None val_d: float = 10.21 val_c: str = "Coqpit is great!" vol_e: bool = True # mandatory field # raise an error when accessing the value if it is not changed. It is a way to define val_k: int = MISSING # optional field val_dict: dict = field(default_factory=lambda: {"val_aa": 10, "val_ss": "This is in a dict."}) # list of list val_listoflist: List[List] = field(default_factory=lambda: [[1, 2], [3, 4]]) val_listofunion: List[List[Union[str, int, bool]]] = field( default_factory=lambda: [[1, 3], [1, "Hi!"], [True, False]] ) def check_values( self, ): # you can define explicit constraints manually or by`check_argument()` """Check config fields""" c = asdict(self) # avoid unexpected changes on `self` check_argument("val_a", c, restricted=True, min_val=10, max_val=2056) check_argument("val_b", c, restricted=True, min_val=128, max_val=4058, allow_none=True) check_argument("val_c", c, restricted=True) ``` In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime. It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments. The general configuration hierarchy looks like below: ``` ModelConfig() | | -> ... # model specific configurations | -> ModelArgs() # model class arguments | -> BaseDatasetConfig() # only for tts models | -> BaseXModelConfig() # Generic fields for `tts` and `vocoder` models. | | -> BaseTrainingConfig() # trainer fields | -> BaseAudioConfig() # audio processing fields ``` In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model. We host pre-defined model configurations under ```TTS//configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided. --- ```{include} ../../CONTRIBUTING.md :relative-images: ``` --- (docker_images)= ## Docker images We provide docker images to be able to test TTS without having to setup your own environment. ### Using premade images You can use premade images built automatically from the latest TTS version. #### CPU version ```bash docker pull ghcr.io/coqui-ai/tts-cpu ``` #### GPU version ```bash docker pull ghcr.io/coqui-ai/tts ``` ### Building your own image ```bash docker build -t tts . ``` ## Basic inference Basic usage: generating an audio file from a text passed as argument. You can pass any tts argument after the image name. ### CPU version ```bash docker run --rm -v ~/tts-output:/root/tts-output ghcr.io/coqui-ai/tts-cpu --text "Hello." --out_path /root/tts-output/hello.wav ``` ### GPU version For the GPU version, you need to have the latest NVIDIA drivers installed. With `nvidia-smi` you can check the CUDA version supported, it must be >= 11.8 ```bash docker run --rm --gpus all -v ~/tts-output:/root/tts-output ghcr.io/coqui-ai/tts --text "Hello." --out_path /root/tts-output/hello.wav --use_cuda true ``` ## Start a server Starting a TTS server: Start the container and get a shell inside it. ### CPU version ```bash docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu python3 TTS/server/server.py --list_models #To get the list of available models python3 TTS/server/server.py --model_name tts_models/en/vctk/vits ``` ### GPU version ```bash docker run --rm -it -p 5002:5002 --gpus all --entrypoint /bin/bash ghcr.io/coqui-ai/tts python3 TTS/server/server.py --list_models #To get the list of available models python3 TTS/server/server.py --model_name tts_models/en/vctk/vits --use_cuda true ``` Click [there](http://[::1]:5002/) and have fun with the server! --- # Humble FAQ We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper. ## Errors with a pre-trained model. How can I resolve this? - Make sure you use the right commit version of 🐸TTS. Each pre-trained model has its corresponding version that needs to be used. It is defined on the model table. - If it is still problematic, post your problem on [Discussions](https://github.com/coqui-ai/TTS/discussions). Please give as many details as possible (error message, your TTS version, your TTS model and config.json etc.) - If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny. ## What are the requirements of a good 🐸TTS dataset? * {ref}`See this page ` ## How should I choose the right model? - First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2. - Tacotron models produce the most natural voice if your dataset is not too noisy. - If both models do not perform well and especially the attention does not align, then try AlignTTS or GlowTTS. - If you need faster models, consider SpeedySpeech, GlowTTS or AlignTTS. Keep in mind that SpeedySpeech requires a pre-trained Tacotron or Tacotron2 model to compute text-to-speech alignments. ## How can I train my own `tts` model? 0. Check your dataset with notebooks in [dataset_analysis](https://github.com/coqui-ai/TTS/tree/master/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/coqui-ai/TTS/blob/master/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis. 1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech. A `formatter` parses the metadata file and converts a list of training samples. 2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```. - If you use phonemes for training and your language is supported [here](https://github.com/rhasspy/gruut#supported-languages), you don't need to set your character list. - You can use `TTS/bin/find_unique_chars.py` to get characters used in your dataset. 3. Write your own text cleaner in ```utils.text.cleaners```. It is not always necessary, except when you have a different alphabet or language-specific requirements. - A `cleaner` performs number and abbreviation expansion and text normalization. Basically, it converts the written text to its spoken format. - If you go lazy, you can try using ```basic_cleaners```. 4. Fill in a ```config.json```. Go over each parameter one by one and consider it regarding the appended explanation. - Check the `Coqpit` class created for your target model. Coqpit classes for `tts` models are under `TTS/tts/configs/`. - You just need to define fields you need/want to change in your `config.json`. For the rest, their default values are used. - 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases. - Here is a sample `config.json` for training a `GlowTTS` network. ```json { "model": "glow_tts", "batch_size": 32, "eval_batch_size": 16, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": -1, "epochs": 1000, "text_cleaner": "english_cleaners", "use_phonemes": false, "phoneme_language": "en-us", "phoneme_cache_path": "phoneme_cache", "print_step": 25, "print_eval": true, "mixed_precision": false, "output_path": "recipes/ljspeech/glow_tts/", "test_sentences": ["Test this sentence.", "This test sentence.", "Sentence this test."], "datasets":[{"formatter": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}] } ``` 6. Train your model. - SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json``` - MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json``` **Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```. ## How can I train in a different language? - Check steps 2, 3, 4, 5 above. ## How can I train multi-GPUs? - Check step 5 above. ## How can I check model performance? - You can inspect model training and performance using ```tensorboard```. It will show you loss, attention alignment, model output. Go with the order below to measure the model performance. 1. Check ground truth spectrograms. If they do not look as they are supposed to, then check audio processing parameters in ```config.json```. 2. Check train and eval losses and make sure that they all decrease smoothly in time. 3. Check model spectrograms. Especially, training outputs should look similar to ground truth spectrograms after ~10K iterations. 4. Your model would not work well at test time until the attention has a near diagonal alignment. This is the sublime art of TTS training. - Attention should converge diagonally after ~50K iterations. - If attention does not converge, the probabilities are; - Your dataset is too noisy or small. - Samples are too long. - Batch size is too small (batch_size < 32 would be having a hard time converging) - You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'. - 'bidirectional_decoder' is your ultimate savior, but it trains 2x slower and demands 1.5x more GPU memory. - You can also try the other models like AlignTTS or GlowTTS. ## How do I know when to stop training? There is no single objective metric to decide the end of a training since the voice quality is a subjective matter. In our model trainings, we follow these steps; - Check test time audio outputs, if it does not improve more. - Check test time attention maps, if they look clear and diagonal. - Check validation loss, if it converged and smoothly went down or started to overfit going up. - If the answer is YES for all of the above, then test the model with a set of complex sentences. For English, you can use the `TestAttention` notebook. Keep in mind that the approach above only validates the model robustness. It is hard to estimate the voice quality without asking the actual people. The best approach is to pick a set of promising models and run a Mean-Opinion-Score study asking actual people to score the models. ## My model does not learn. How can I debug? - Go over the steps under "How can I check model performance?" ## Attention does not align. How can I make it work? - Check the 4th step under "How can I check model performance?" ## How can I test a trained model? - The best way is to use `tts` or `tts-server` commands. For details check {ref}`here `. - If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class. ## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work. - In general, all of the above relates to the `stopnet`. It is the part of the model telling the `decoder` when to stop. - In general, a poor `stopnet` relates to something else that is broken in your model or dataset. Especially the attention module. - One common reason is the silent parts in the audio clips at the beginning and the ending. Check ```trim_db``` value in the config. You can find a better value for your dataset by using ```CheckSpectrogram``` notebook. If this value is too small, too much of the audio will be trimmed. If too big, then too much silence will remain. Both will curtail the `stopnet` performance. --- # Fine-tuning a 🐸 TTS model ## Fine-tuning Fine-tuning takes a pre-trained model and retrains it to improve the model performance on a different task or dataset. In 🐸TTS we provide different pre-trained models in different languages and different pros and cons. You can take one of them and fine-tune it for your own dataset. This will help you in two main ways: 1. Faster learning Since a pre-trained model has already learned features that are relevant for the task, it will converge faster on a new dataset. This will reduce the cost of training and let you experiment faster. 2. Better results with small datasets Deep learning models are data hungry and they give better performance with more data. However, it is not always possible to have this abundance, especially in specific domains. For instance, the LJSpeech dataset, that we released most of our English models with, is almost 24 hours long. It takes weeks to record this amount of data with the help of a voice actor. Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own speech dataset and achieve reasonable results with only a couple of hours of data. However, note that, fine-tuning does not ensure great results. The model performance still depends on the {ref}`dataset quality ` and the hyper-parameters you choose for fine-tuning. Therefore, it still takes a bit of tinkering. ## Steps to fine-tune a 🐸 TTS model 1. Setup your dataset. You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the training. Please see {ref}`this page ` for more information about formatting. 2. Choose the model you want to fine-tune. You can list the available models in the command line with ```bash tts --list_models ``` The command above lists the models in a naming format as ```///```. Or you can manually check the `.model.json` file in the project directory. You should choose the model based on your requirements. Some models are fast and some are better in speech quality. One lazy way to test a model is running the model on the hardware you want to use and see how it works. For simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here `. 3. Download the model. You can download the model by using the `tts` command. If you run `tts` with a particular model, it will download it automatically and the model path will be printed on the terminal. ```bash tts --model_name tts_models/es/mai/tacotron2-DDC --text "Ola." > Downloading model to /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts ... ``` In the example above, we called the Spanish Tacotron model and give the sample output showing use the path where the model is downloaded. 4. Setup the model config for fine-tuning. You need to change certain fields in the model config. You have 3 options for playing with the configuration. 1. Edit the fields in the ```config.json``` file if you want to use ```TTS/bin/train_tts.py``` to train the model. 2. Edit the fields in one of the training scripts in the ```recipes``` directory if you want to use python. 3. Use the command-line arguments to override the fields like ```--coqpit.lr 0.00001``` to change the learning rate. Some of the important fields are as follows: - `datasets` field: This is set to the dataset you want to fine-tune the model on. - `run_name` field: This is the name of the run. This is used to name the output directory and the entry in the logging dashboard. - `output_path` field: This is the path where the fine-tuned model is saved. - `lr` field: You may need to use a smaller learning rate for fine-tuning to not lose the features learned by the pre-trained model with big update steps. - `audio` fields: Different datasets have different audio characteristics. You must check the current audio parameters and make sure that the values reflect your dataset. For instance, your dataset might have a different audio sampling rate. Apart from the parameters above, you should check the whole configuration file and make sure that the values are correct for your dataset and training. 5. Start fine-tuning. Whether you use one of the training scripts under ```recipes``` folder or the ```train_tts.py``` to start your training, you should use the ```--restore_path``` flag to specify the path to the pre-trained model. ```bash CUDA_VISIBLE_DEVICES="0" python recipes/ljspeech/glow_tts/train_glowtts.py \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth ``` ```bash CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py \ --config_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/config.json \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth ``` As stated above, you can also use command-line arguments to change the model configuration. ```bash CUDA_VISIBLE_DEVICES="0" python recipes/ljspeech/glow_tts/train_glowtts.py \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth --coqpit.run_name "glow-tts-finetune" \ --coqpit.lr 0.00001 ``` --- (formatting_your_dataset)= # Formatting Your Dataset For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription. If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format. Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`. ``` /wavs | - audio1.wav | - audio2.wav | - audio3.wav ... ``` You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text. We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc. ``` # metadata.txt audio1|This is my sentence.|This is my sentence. audio2|1469 and 1470|fourteen sixty-nine and fourteen seventy audio3|It'll be $16 sir.|It'll be sixteen dollars sir. ... ``` *If you don't have normalized transcriptions, you can use the same transcription for both columns. If it's your case, we recommend to use normalization later in the pipeline, either in the text cleaner or in the phonemizer.* In the end, we have the following folder structure ``` /MyTTSDataset | | -> metadata.txt | -> /wavs | -> audio1.wav | -> audio2.wav | ... ``` The format above is taken from widely-used the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. You can also download and see the dataset. 🐸TTS already provides tooling for the LJSpeech. if you use the same format, you can start training your models right away. ## Dataset Quality Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English. For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset). ## Using Your Dataset in 🐸TTS After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format). If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`. If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`. What you get out of a `formatter` is a `List[Dict]` in the following format. ``` >>> formatter(metafile_path) [ {"audio_file":"audio1.wav", "text":"This is my sentence.", "speaker_name":"MyDataset", "language": "lang_code"}, {"audio_file":"audio1.wav", "text":"This is maybe a sentence.", "speaker_name":"MyDataset", "language": "lang_code"}, ... ] ``` Each sub-list is parsed as ```{"", "", "]```. `````` is the dataset name for single speaker datasets and it is mainly used in the multi-speaker models to map the speaker of the each sample. But for now, we only focus on single speaker datasets. The purpose of a `formatter` is to parse your manifest file and load the audio file paths and transcriptions. Then, the output is passed to the `Dataset`. It computes features from the audio signals, calls text normalization routines, and converts raw text to phonemes if needed. ## Loading your dataset Load one of the dataset supported by 🐸TTS. ```python from TTS.tts.configs.shared_configs import BaseDatasetConfig from TTS.tts.datasets import load_tts_samples # dataset config for one of the pre-defined datasets dataset_config = BaseDatasetConfig( formatter="vctk", meta_file_train="", language="en-us", path="dataset-path") ) # load training samples train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True) ``` Load a custom dataset with a custom formatter. ```python from TTS.tts.datasets import load_tts_samples # custom formatter implementation def formatter(root_path, manifest_file, **kwargs): # pylint: disable=unused-argument """Assumes each line as ```|``` """ txt_file = os.path.join(root_path, manifest_file) items = [] speaker_name = "my_speaker" with open(txt_file, "r", encoding="utf-8") as ttf: for line in ttf: cols = line.split("|") wav_file = os.path.join(root_path, "wavs", cols[0]) text = cols[1] items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name, "root_path": root_path}) return items # load training samples train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, formatter=formatter) ``` See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `tts` models. See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models. See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a `Dataset` implementation. Feel free to add things as you need. --- # Implementing a New Language Frontend - Language frontends are located under `TTS.tts.utils.text` - Each special language has a separate folder. - Each folder contains all the utilities for processing the text input. - `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities from the previous step and used to convert the text to phonemes or graphemes for the model. - After you implement your phonemizer, you need to add it to the `TTS/tts/utils/text/phonemizers/__init__.py` to be able to map the language code in the model config - `config.phoneme_language` - to the phonemizer class and initiate the phonemizer automatically. - You should also add tests to `tests/text_tests` if you want to make a PR. We suggest you to check the available implementations as reference. Good luck! --- # Implementing a Model 1. Implement layers. You can either implement the layers under `TTS/tts/layers/new_model.py` or in the model file `TTS/tts/model/new_model.py`. You can also reuse layers already implemented. 2. Test layers. We keep tests under `tests` folder. You can add `tts` layers tests under `tts_tests` folder. Basic tests are checking input-output tensor shapes and output values for a given input. Consider testing extreme cases that are more likely to cause problems like `zero` tensors. 3. Implement a loss function. We keep loss functions under `TTS/tts/layers/losses.py`. You can also mix-and-match implemented loss functions as you like. A loss function returns a dictionary in a format ```{’loss’: loss, ‘loss1’:loss1 ...}``` and the dictionary must at least define the `loss` key which is the actual value used by the optimizer. All the items in the dictionary are automatically logged on the terminal and the Tensorboard. 4. Test the loss function. As we do for the layers, you need to test the loss functions too. You need to check input/output tensor shapes, expected output values for a given input tensor. For instance, certain loss functions have upper and lower limits and it is a wise practice to test with the inputs that should produce these limits. 5. Implement `MyModel`. In 🐸TTS, a model class is a self-sufficient implementation of a model directing all the interactions with the other components. It is enough to implement the API provided by the `BaseModel` class to comply. A model interacts with the `Trainer API` for training, `Synthesizer API` for inference and testing. A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`. You can place your `tts` model implementation under `TTS/tts/models/new_model.py` then inherit and implement the `BaseTTS`. There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you an infinite flexibility to add custom behaviours for your model and training routines. For more details, see {ref}`BaseTTS ` and :obj:`TTS.utils.callbacks`. 6. Optionally, define `MyModelArgs`. `MyModelArgs` is a 👨‍✈️Coqpit class that sets all the class arguments of the `MyModel`. `MyModelArgs` must have all the fields necessary to instantiate the `MyModel`. However, for training, you need to pass `MyModelConfig` to the model. 7. Test `MyModel`. As the layers and the loss functions, it is recommended to test your model. One smart way for testing is that you create two models with the exact same weights. Then we run a training loop with one of these models and compare the weights with the other model. All the weights need to be different in a passing test. Otherwise, it is likely that a part of the model is malfunctioning or not even attached to the model's computational graph. 8. Define `MyModelConfig`. Place `MyModelConfig` file under `TTS/models/configs`. It is enough to inherit the `BaseTTSConfig` to make your config compatible with the `Trainer`. You should also include `MyModelArgs` as a field if defined. The rest of the fields should define the model specific values and parameters. 9. Write Docstrings. We love you more when you document your code. ❤️ # Template 🐸TTS Model implementation You can start implementing your model by copying the following base class. ```python from TTS.tts.models.base_tts import BaseTTS class MyModel(BaseTTS): """ Notes on input/output tensor shapes: Any input or output tensor of the model must be shaped as - 3D tensors `batch x time x channels` - 2D tensors `batch x channels` - 1D tensors `batch x 1` """ def __init__(self, config: Coqpit): super().__init__() self._set_model_args(config) def _set_model_args(self, config: Coqpit): """Set model arguments from the config. Override this.""" pass def forward(self, input: torch.Tensor, *args, aux_input={}, **kwargs) -> Dict: """Forward pass for the model mainly used in training. You can be flexible here and use different number of arguments and argument names since it is intended to be used by `train_step()` without exposing it out of the model. Args: input (torch.Tensor): Input tensor. aux_input (Dict): Auxiliary model inputs like embeddings, durations or any other sorts of inputs. Returns: Dict: Model outputs. Main model output must be named as "model_outputs". """ outputs_dict = {"model_outputs": None} ... return outputs_dict def inference(self, input: torch.Tensor, aux_input={}) -> Dict: """Forward pass for inference. We don't use `*kwargs` since it is problematic with the TorchScript API. Args: input (torch.Tensor): [description] aux_input (Dict): Auxiliary inputs like speaker embeddings, durations etc. Returns: Dict: [description] """ outputs_dict = {"model_outputs": None} ... return outputs_dict def train_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]: """Perform a single training step. Run the model forward pass and compute losses. Args: batch (Dict): Input tensors. criterion (nn.Module): Loss layer designed for the model. Returns: Tuple[Dict, Dict]: Model ouputs and computed losses. """ outputs_dict = {} loss_dict = {} # this returns from the criterion ... return outputs_dict, loss_dict def train_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None: """Create visualizations and waveform examples for training. For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to be projected onto Tensorboard. Args: ap (AudioProcessor): audio processor used at training. batch (Dict): Model inputs used at the previous training step. outputs (Dict): Model outputs generated at the previous training step. Returns: Tuple[Dict, np.ndarray]: training plots and output waveform. """ pass def eval_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]: """Perform a single evaluation step. Run the model forward pass and compute losses. In most cases, you can call `train_step()` with no changes. Args: batch (Dict): Input tensors. criterion (nn.Module): Loss layer designed for the model. Returns: Tuple[Dict, Dict]: Model ouputs and computed losses. """ outputs_dict = {} loss_dict = {} # this returns from the criterion ... return outputs_dict, loss_dict def eval_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None: """The same as `train_log()`""" pass def load_checkpoint(self, config: Coqpit, checkpoint_path: str, eval: bool = False) -> None: """Load a checkpoint and get ready for training or inference. Args: config (Coqpit): Model configuration. checkpoint_path (str): Path to the model checkpoint file. eval (bool, optional): If true, init model for inference else for training. Defaults to False. """ ... def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]: """Setup a return optimizer or optimizers.""" pass def get_lr(self) -> Union[float, List[float]]: """Return learning rate(s). Returns: Union[float, List[float]]: Model's initial learning rates. """ pass def get_scheduler(self, optimizer: torch.optim.Optimizer): pass def get_criterion(self): pass def format_batch(self): pass ``` --- ```{include} ../../README.md :relative-images: ``` ---- # Documentation Content ```{eval-rst} .. toctree:: :maxdepth: 2 :caption: Get started tutorial_for_nervous_beginners installation faq contributing .. toctree:: :maxdepth: 2 :caption: Using 🐸TTS inference docker_images implementing_a_new_model implementing_a_new_language_frontend training_a_model finetuning configuration formatting_your_dataset what_makes_a_good_dataset tts_datasets marytts .. toctree:: :maxdepth: 2 :caption: Main Classes main_classes/trainer_api main_classes/audio_processor main_classes/model_api main_classes/dataset main_classes/gan main_classes/speaker_manager .. toctree:: :maxdepth: 2 :caption: `tts` Models models/glow_tts.md models/vits.md models/forward_tts.md models/tacotron1-2.md models/overflow.md models/tortoise.md models/bark.md models/xtts.md .. toctree:: :maxdepth: 2 :caption: `vocoder` Models ``` --- (synthesizing_speech)= # Synthesizing Speech First, you need to install TTS. We recommend using PyPi. You need to call the command below: ```bash $ pip install TTS ``` After the installation, 2 terminal commands are available. 1. TTS Command Line Interface (CLI). - `tts` 2. Local Demo Server. - `tts-server` 3. In 🐍Python. - `from TTS.api import TTS` ## On the Commandline - `tts` ![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif) After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS. Listing released 🐸TTS models. ```bash tts --list_models ``` Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.) ```bash tts --text "Text for TTS" \ --model_name "///" \ --out_path folder/to/save/output.wav ``` Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. ```bash tts --text "Text for TTS" \ --model_name "tts_models///" \ --vocoder_name "vocoder_models///" \ --out_path folder/to/save/output.wav ``` Run your own TTS model (Using Griffin-Lim Vocoder) ```bash tts --text "Text for TTS" \ --model_path path/to/model.pth \ --config_path path/to/config.json \ --out_path folder/to/save/output.wav ``` Run your own TTS and Vocoder models ```bash tts --text "Text for TTS" \ --config_path path/to/config.json \ --model_path path/to/model.pth \ --out_path folder/to/save/output.wav \ --vocoder_path path/to/vocoder.pth \ --vocoder_config_path path/to/vocoder_config.json ``` Run a multi-speaker TTS model from the released models list. ```bash tts --model_name "tts_models///" --list_speaker_idxs # list the possible speaker IDs. tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "tts_models///" --speaker_idx "" ``` Run a released voice conversion model ```bash tts --model_name "voice_conversion///" --source_wav "my/source/speaker/audio.wav" --target_wav "my/target/speaker/audio.wav" --out_path folder/to/save/output.wav ``` **Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder. ## On the Demo Server - `tts-server` ![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif) You can boot up a demo 🐸TTS server to run an inference with your models. Note that the server is not optimized for performance but gives you an easy way to interact with the models. The demo server provides pretty much the same interface as the CLI command. ```bash tts-server -h # see the help tts-server --list_models # list the available models. ``` Run a TTS model, from the release models list, with its default vocoder. If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize speech. ```bash tts-server --model_name "///" ``` Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. ```bash tts-server --model_name "///" \ --vocoder_name "///" ``` ## Python 🐸TTS API You can run a multi-speaker and multi-lingual model in Python as ```python import torch from TTS.api import TTS # Get device device = "cuda" if torch.cuda.is_available() else "cpu" # List available 🐸TTS models print(TTS().list_models()) # Init TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) # Run TTS # ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language # Text to speech list of amplitude values as output wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en") # Text to speech to a file tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") ``` #### Here is an example for a single speaker model. ```python # Init TTS with the target model name tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False) # Run TTS tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH) ``` #### Example voice cloning with YourTTS in English, French and Portuguese: ```python tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda") tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr", file_path="output.wav") tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav") ``` #### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav` ```python tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda") tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav") ``` #### Example voice cloning by a single speaker TTS model combining with the voice conversion model. This way, you can clone voices by using any model in 🐸TTS. ```python tts = TTS("tts_models/de/thorsten/tacotron2-DDC") tts.tts_with_vc_to_file( "Wie sage ich auf Italienisch, dass ich dich liebe?", speaker_wav="target/speaker.wav", file_path="ouptut.wav" ) ``` #### Example text to speech using **Fairseq models in ~1100 languages** 🤯. For these models use the following name format: `tts_models//fairseq/vits`. You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). ```python from TTS.api import TTS api = TTS(model_name="tts_models/eng/fairseq/vits").to("cuda") api.tts_to_file("This is a test.", file_path="output.wav") # TTS with on the fly voice conversion api = TTS("tts_models/deu/fairseq/vits") api.tts_with_vc_to_file( "Wie sage ich auf Italienisch, dass ich dich liebe?", speaker_wav="target/speaker.wav", file_path="ouptut.wav" ) ``` --- # Installation 🐸TTS supports python >=3.7 <3.11.0 and tested on Ubuntu 18.10, 19.10, 20.10. ## Using `pip` `pip` is recommended if you want to use 🐸TTS only for inference. You can install from PyPI as follows: ```bash pip install TTS # from PyPI ``` Or install from Github: ```bash pip install git+https://github.com/coqui-ai/TTS # from Github ``` ## Installing From Source This is recommended for development and more control over 🐸TTS. ```bash git clone https://github.com/coqui-ai/TTS/ cd TTS make system-deps # only on Linux systems. make install ``` ## On Windows If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](https://stackoverflow.com/questions/66726331/ --- # AudioProcessor API `TTS.utils.audio.AudioProcessor` is the core class for all the audio processing routines. It provides an API for - Feature extraction. - Sound normalization. - Reading and writing audio files. - Sampling audio signals. - Normalizing and denormalizing audio signals. - Griffin-Lim vocoder. The `AudioProcessor` needs to be initialized with `TTS.config.shared_configs.BaseAudioConfig`. Any model config also must inherit or initiate `BaseAudioConfig`. ## AudioProcessor ```{eval-rst} .. autoclass:: TTS.utils.audio.AudioProcessor :members: ``` ## BaseAudioConfig ```{eval-rst} .. autoclass:: TTS.config.shared_configs.BaseAudioConfig :members: ``` --- # Datasets ## TTS Dataset ```{eval-rst} .. autoclass:: TTS.tts.datasets.TTSDataset :members: ``` ## Vocoder Dataset ```{eval-rst} .. autoclass:: TTS.vocoder.datasets.gan_dataset.GANDataset :members: ``` ```{eval-rst} .. autoclass:: TTS.vocoder.datasets.wavegrad_dataset.WaveGradDataset :members: ``` ```{eval-rst} .. autoclass:: TTS.vocoder.datasets.wavernn_dataset.WaveRNNDataset :members: ``` --- # GAN API The {class}`TTS.vocoder.models.gan.GAN` provides an easy way to implementing new GAN based models. You just need to define the model architectures for the generator and the discriminator networks and give them to the `GAN` class to do its ✨️. ## GAN ```{eval-rst} .. autoclass:: TTS.vocoder.models.gan.GAN :members: ``` --- # Model API Model API provides you a set of functions that easily make your model compatible with the `Trainer`, `Synthesizer` and `ModelZoo`. ## Base TTS Model ```{eval-rst} .. autoclass:: TTS.model.BaseTrainerModel :members: ``` ## Base tts Model ```{eval-rst} .. autoclass:: TTS.tts.models.base_tts.BaseTTS :members: ``` ## Base vocoder Model ```{eval-rst} .. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder :members: ``` --- # Speaker Manager API The {class}`TTS.tts.utils.speakers.SpeakerManager` organize speaker related data and information for 🐸TTS models. It is especially useful for multi-speaker models. ## Speaker Manager ```{eval-rst} .. automodule:: TTS.tts.utils.speakers :members: ``` --- # Trainer API We made the trainer a separate project on https://github.com/coqui-ai/Trainer --- # Mary-TTS API Support for Coqui-TTS ## What is Mary-TTS? [Mary (Modular Architecture for Research in sYnthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI. MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016. You can check out this OpenVoice-Tech page to learn more: https://openvoice-tech.net/index.php/MaryTTS ## Why Mary-TTS compatibility is relevant Due to its open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away. ## API and code examples Like Coqui-TTS, Mary-TTS can run as HTTP server to allow access to the API via HTTP GET and POST calls. The best documentations of this API are probably the [web-page](https://github.com/marytts/marytts/tree/master/marytts-runtime/src/main/resources/marytts/server/http), available via your self-hosted Mary-TTS server and the [Java docs page](http://mary.dfki.de/javadoc/marytts/server/http/MaryHttpServer.html). Mary-TTS offers a larger number of endpoints to load styles, audio effects, examples etc., but compatible tools often only require 3 of them to work: - `/locales` (GET) - Returns a list of supported locales in the format `[locale]\n...`, for example "en_US" or "de_DE" or simply "en" etc. - `/voices` (GET) - Returns a list of supported voices in the format `[name] [locale] [gender]\n...`, 'name' can be anything without spaces(!) and 'gender' is traditionally `f` or `m` - `/process?INPUT_TEXT=[my text]&INPUT_TYPE=TEXT&LOCALE=[locale]&VOICE=[name]&OUTPUT_TYPE=AUDIO&AUDIO=WAVE_FILE` (GET/POST) - Processes the input text and returns a wav file. INPUT_TYPE, OUTPUT_TYPE and AUDIO support additional values, but are usually static in compatible tools. If your Coqui-TTS server is running on `localhost` using `port` 59125 (for classic Mary-TTS compatibility) you can us the following CURL requests to test the API: Return locale of active voice, e.g. "en": ```bash curl http://localhost:59125/locales ``` Return name of active voice, e.g. "glow-tts en u" ```bash curl http://localhost:59125/voices ``` Create a wav-file with spoken input text: ```bash curl http://localhost:59125/process?INPUT_TEXT=this+is+a+test > test.wav ``` You can enter the same URLs in your browser and check-out the results there as well. ### How it works and limitations A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined). We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time. --- # 🐶 Bark Bark is a multi-lingual TTS model created by [Suno-AI](https://www.suno.ai/). It can generate conversational speech as well as music and sound effects. It is architecturally very similar to Google's [AudioLM](https://arxiv.org/abs/2209.03143). For more information, please refer to the [Suno-AI's repo](https://github.com/suno-ai/bark). ## Acknowledgements - 👑[Suno-AI](https://www.suno.ai/) for training and open-sourcing this model. - 👑[gitmylo](https://github.com/gitmylo) for finding [the solution](https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/) to the semantic token generation for voice clones and finetunes. - 👑[serp-ai](https://github.com/serp-ai/bark-with-voice-clone) for controlled voice cloning. ## Example Use ```python text = "Hello, my name is Manmay , how are you?" from TTS.tts.configs.bark_config import BarkConfig from TTS.tts.models.bark import Bark config = BarkConfig() model = Bark.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="path/to/model/dir/", eval=True) # with random speaker output_dict = model.synthesize(text, config, speaker_id="random", voice_dirs=None) # cloning a speaker. # It assumes that you have a speaker file in `bark_voices/speaker_n/speaker.wav` or `bark_voices/speaker_n/speaker.npz` output_dict = model.synthesize(text, config, speaker_id="ljspeech", voice_dirs="bark_voices/") ``` Using 🐸TTS API: ```python from TTS.api import TTS # Load the model to GPU # Bark is really slow on CPU, so we recommend using GPU. tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True) # Cloning a new speaker # This expects to find a mp3 or wav file like `bark_voices/new_speaker/speaker.wav` # It computes the cloning values and stores in `bark_voices/new_speaker/speaker.npz` tts.tts_to_file(text="Hello, my name is Manmay , how are you?", file_path="output.wav", voice_dir="bark_voices/", speaker="ljspeech") # When you run it again it uses the stored values to generate the voice. tts.tts_to_file(text="Hello, my name is Manmay , how are you?", file_path="output.wav", voice_dir="bark_voices/", speaker="ljspeech") # random speaker tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True) tts.tts_to_file("hello world", file_path="out.wav") ``` Using 🐸TTS Command line: ```console # cloning the `ljspeech` voice tts --model_name tts_models/multilingual/multi-dataset/bark \ --text "This is an example." \ --out_path "output.wav" \ --voice_dir bark_voices/ \ --speaker_idx "ljspeech" \ --progress_bar True # Random voice generation tts --model_name tts_models/multilingual/multi-dataset/bark \ --text "This is an example." \ --out_path "output.wav" \ --progress_bar True ``` ## Important resources & papers - Original Repo: https://github.com/suno-ai/bark - Cloning implementation: https://github.com/serp-ai/bark-with-voice-clone - AudioLM: https://arxiv.org/abs/2209.03143 ## BarkConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.bark_config.BarkConfig :members: ``` ## Bark Model ```{eval-rst} .. autoclass:: TTS.tts.models.bark.Bark :members: ``` --- # Forward TTS model(s) A general feed-forward TTS model implementation that can be configured to different architectures by setting different encoder and decoder networks. It can be trained with either pre-computed durations (from pre-trained Tacotron) or an alignment network that learns the text to audio alignment from the input data. Currently we provide the following pre-configured architectures: - **FastSpeech:** It's a feed-forward model TTS model that uses Feed Forward Transformer (FFT) modules as the encoder and decoder. - **FastPitch:** It uses the same FastSpeech architecture that is conditioned on fundamental frequency (f0) contours with the promise of more expressive speech. - **SpeedySpeech:** It uses Residual Convolution layers instead of Transformers that leads to a more compute friendly model. - **FastSpeech2 (TODO):** Similar to FastPitch but it also uses a spectral energy values as an addition. ## Important resources & papers - FastPitch: https://arxiv.org/abs/2006.06873 - SpeedySpeech: https://arxiv.org/abs/2008.03802 - FastSpeech: https://arxiv.org/pdf/1905.09263 - FastSpeech2: https://arxiv.org/abs/2006.04558 - Aligner Network: https://arxiv.org/abs/2108.10447 - What is Pitch: https://www.britannica.com/topic/pitch-speech ## ForwardTTSArgs ```{eval-rst} .. autoclass:: TTS.tts.models.forward_tts.ForwardTTSArgs :members: ``` ## ForwardTTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.forward_tts.ForwardTTS :members: ``` ## FastPitchConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.fast_pitch_config.FastPitchConfig :members: ``` ## SpeedySpeechConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.speedy_speech_config.SpeedySpeechConfig :members: ``` ## FastSpeechConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.fast_speech_config.FastSpeechConfig :members: ``` --- # Glow TTS Glow TTS is a normalizing flow model for text-to-speech. It is built on the generic Glow model that is previously used in computer vision and vocoder models. It uses "monotonic alignment search" (MAS) to fine the text-to-speech alignment and uses the output to train a separate duration predictor network for faster inference run-time. ## Important resources & papers - GlowTTS: https://arxiv.org/abs/2005.11129 - Glow (Generative Flow with invertible 1x1 Convolutions): https://arxiv.org/abs/1807.03039 - Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html ## GlowTTS Config ```{eval-rst} .. autoclass:: TTS.tts.configs.glow_tts_config.GlowTTSConfig :members: ``` ## GlowTTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.glow_tts.GlowTTS :members: ``` --- # Overflow TTS Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Compared to dominant flow-based acoustic models, our approach integrates autoregression for improved modelling of long-range dependences such as utterance-level prosody. Experiments show that a system based on our proposal gives more accurate pronunciations and better subjective speech quality than comparable methods, whilst retaining the original advantages of neural HMMs. Audio examples and code are available at https://shivammehta25.github.io/OverFlow/. ## Important resources & papers - HMM: https://de.wikipedia.org/wiki/Hidden_Markov_Model - OverflowTTS paper: https://arxiv.org/abs/2211.06892 - Neural HMM: https://arxiv.org/abs/2108.13320 - Audio Samples: https://shivammehta25.github.io/OverFlow/ ## OverflowConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.overflow_config.OverflowConfig :members: ``` ## Overflow Model ```{eval-rst} .. autoclass:: TTS.tts.models.overflow.Overflow :members: ``` --- # 🌮 Tacotron 1 and 2 Tacotron is one of the first successful DL-based text-to-mel models and opened up the whole TTS field for more DL research. Tacotron mainly is an encoder-decoder model with attention. The encoder takes input tokens (characters or phonemes) and the decoder outputs mel-spectrogram* frames. Attention module in-between learns to align the input tokens with the output mel-spectrgorams. Tacotron1 and 2 are both built on the same encoder-decoder architecture but they use different layers. Additionally, Tacotron1 uses a Postnet module to convert mel-spectrograms to linear spectrograms with a higher resolution before the vocoder. Vanilla Tacotron models are slow at inference due to the auto-regressive* nature that prevents the model to process all the inputs in parallel. One trick is to use a higher “reduction rate” that helps the model to predict multiple frames at once. That is, reduction rate 2 reduces the number of decoder iterations by half. Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and most of the implementations use the Dropout layer even in inference and they report the attention fails or the voice quality degrades otherwise. But the issue with that, you get a slightly different output speech every time you run the model. Training the attention is notoriously problematic in Tacoron models. Especially, in inference, for some input sequences, the alignment fails and causes the model to produce unexpected results. There are many different methods proposed to improve the attention. After hundreds of experiments, @ 🐸TTS we suggest Double Decoder Consistency that leads to the most robust model performance. If you have a limited VRAM, then you can try using the Guided Attention Loss or the Dynamic Convolutional Attention. You can also combine the two. ## Important resources & papers - Tacotron: https://arxiv.org/abs/2006.06873 - Tacotron2: https://arxiv.org/abs/2008.03802 - Double Decoder Consistency: https://coqui.ai/blog/tts/solving-attention-problems-of-tts-models-with-double-decoder-consistency - Guided Attention Loss: https://arxiv.org/abs/1710.08969 - Forward & Backward Decoder: https://arxiv.org/abs/1907.09006 - Forward Attention: https://arxiv.org/abs/1807.06736 - Gaussian Attention: https://arxiv.org/abs/1910.10288 - Dynamic Convolutional Attention: https://arxiv.org/pdf/1910.10288.pdf ## BaseTacotron ```{eval-rst} .. autoclass:: TTS.tts.models.base_tacotron.BaseTacotron :members: ``` ## Tacotron ```{eval-rst} .. autoclass:: TTS.tts.models.tacotron.Tacotron :members: ``` ## Tacotron2 ```{eval-rst} .. autoclass:: TTS.tts.models.tacotron2.Tacotron2 :members: ``` ## TacotronConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.tacotron_config.TacotronConfig :members: ``` ## Tacotron2Config ```{eval-rst} .. autoclass:: TTS.tts.configs.tacotron2_config.Tacotron2Config :members: ``` --- # 🐢 Tortoise Tortoise is a very expressive TTS system with impressive voice cloning capabilities. It is based on an GPT like autogressive acoustic model that converts input text to discritized acoustic tokens, a diffusion model that converts these tokens to melspectrogram frames and a Univnet vocoder to convert the spectrograms to the final audio signal. The important downside is that Tortoise is very slow compared to the parallel TTS models like VITS. Big thanks to 👑[@manmay-nakhashi](https://github.com/manmay-nakhashi) who helped us implement Tortoise in 🐸TTS. Example use: ```python from TTS.tts.configs.tortoise_config import TortoiseConfig from TTS.tts.models.tortoise import Tortoise config = TortoiseConfig() model = Tortoise.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="paths/to/models_dir/", eval=True) # with random speaker output_dict = model.synthesize(text, config, speaker_id="random", extra_voice_dirs=None, **kwargs) # cloning a speaker output_dict = model.synthesize(text, config, speaker_id="speaker_n", extra_voice_dirs="path/to/speaker_n/", **kwargs) ``` Using 🐸TTS API: ```python from TTS.api import TTS tts = TTS("tts_models/en/multi-dataset/tortoise-v2") # cloning `lj` voice from `TTS/tts/utils/assets/tortoise/voices/lj` # with custom inference settings overriding defaults. tts.tts_to_file(text="Hello, my name is Manmay , how are you?", file_path="output.wav", voice_dir="path/to/tortoise/voices/dir/", speaker="lj", num_autoregressive_samples=1, diffusion_iterations=10) # Using presets with the same voice tts.tts_to_file(text="Hello, my name is Manmay , how are you?", file_path="output.wav", voice_dir="path/to/tortoise/voices/dir/", speaker="lj", preset="ultra_fast") # Random voice generation tts.tts_to_file(text="Hello, my name is Manmay , how are you?", file_path="output.wav") ``` Using 🐸TTS Command line: ```console # cloning the `lj` voice tts --model_name tts_models/en/multi-dataset/tortoise-v2 \ --text "This is an example." \ --out_path "output.wav" \ --voice_dir path/to/tortoise/voices/dir/ \ --speaker_idx "lj" \ --progress_bar True # Random voice generation tts --model_name tts_models/en/multi-dataset/tortoise-v2 \ --text "This is an example." \ --out_path "output.wav" \ --progress_bar True ``` ## Important resources & papers - Original Repo: https://github.com/neonbjb/tortoise-tts - Faster implementation: https://github.com/152334H/tortoise-tts-fast - Univnet: https://arxiv.org/abs/2106.07889 - Latent Diffusion:https://arxiv.org/abs/2112.10752 - DALL-E: https://arxiv.org/abs/2102.12092 ## TortoiseConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.tortoise_config.TortoiseConfig :members: ``` ## TortoiseArgs ```{eval-rst} .. autoclass:: TTS.tts.models.tortoise.TortoiseArgs :members: ``` ## Tortoise Model ```{eval-rst} .. autoclass:: TTS.tts.models.tortoise.Tortoise :members: ``` --- # VITS VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE, Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU. 🐸 YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation. It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training TTS models in low-resources languages. 🐸 YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model. ## Important resources & papers - 🐸 YourTTS: https://arxiv.org/abs/2112.02418 - VITS: https://arxiv.org/pdf/2106.06103.pdf - Neural Spline Flows: https://arxiv.org/abs/1906.04032 - Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf - Generative Adversarial Networks: https://arxiv.org/abs/1406.2661 - HiFiGAN: https://arxiv.org/abs/2010.05646 - Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html ## VitsConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.vits_config.VitsConfig :members: ``` ## VitsArgs ```{eval-rst} .. autoclass:: TTS.tts.models.vits.VitsArgs :members: ``` ## Vits Model ```{eval-rst} .. autoclass:: TTS.tts.models.vits.Vits :members: ``` --- # ⓍTTS ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours. This is the same model that powers [Coqui Studio](https://coqui.ai/), and [Coqui API](https://docs.coqui.ai/docs), however we apply a few tricks to make it faster and support streaming inference. ### Features - Voice cloning. - Cross-language voice cloning. - Multi-lingual speech generation. - 24khz sampling rate. - Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference)) - Fine-tuning support. (See [Training](#training)) ### Updates with v2 - Improved voice cloning. - Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime. - 2 new languages: Hungarian and Korean. - Across the board quality improvements. ### Code Current implementation only supports inference and GPT encoder training. ### Languages As of now, XTTS-v2 supports 16 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko). Stay tuned as we continue to add support for more languages. If you have any language requests, please feel free to reach out. ### License This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). ### Contact Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Twitter](https://twitter.com/coqui_ai). You can also mail us at info@coqui.ai. ### Inference #### 🐸TTS Command line You can check all supported languages with the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --list_language_idx ``` You can check all Coqui available speakers with the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --list_speaker_idx ``` ##### Coqui speakers You can do inference using one of the available speakers using the following command: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent." \ --speaker_idx "Ana Florence" \ --language_idx en \ --use_cuda true ``` ##### Clone a voice You can clone a speaker voice using a single or multiple references: ###### Single reference ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/speaker.wav \ --language_idx tr \ --use_cuda true ``` ###### Multiple references ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/speaker.wav /path/to/target/speaker_2.wav /path/to/target/speaker_3.wav \ --language_idx tr \ --use_cuda true ``` or for all wav files in a directory you can use: ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ --speaker_wav /path/to/target/*.wav \ --language_idx tr \ --use_cuda true ``` #### 🐸TTS API ##### Clone a voice You can clone a speaker voice using a single or multiple references: ###### Single reference Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio. You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit. ```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker_wav=["/path/to/target/speaker.wav"], language="en", split_sentences=True ) ``` ###### Multiple references You can pass multiple audio files to the `speaker_wav` argument for better voice cloning. ```python from TTS.api import TTS # using the default version set in 🐸TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # using a specific version # 👀 see the branch names for versions on https://huggingface.co/coqui/XTTS-v2/tree/main # ❗some versions might be incompatible with the API tts = TTS("xtts_v2.0.2", gpu=True) # getting the latest XTTS_v2 tts = TTS("xtts", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"], language="en") ``` ##### Coqui speakers You can do inference using one of the available speakers using the following code: ```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker="Ana Florence", language="en", split_sentences=True ) ``` #### 🐸TTS Model API To use the model API, you need to download the model files and pass config and model file paths manually. #### Manual Inference If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first. ```console pip install deepspeed==0.10.3 ``` ##### inference parameters - `text`: The text to be synthesized. - `language`: The language of the text to be synthesized. - `gpt_cond_latent`: The latent vector you get with get_conditioning_latents. (You can cache for faster inference with same speaker) - `speaker_embedding`: The speaker embedding you get with get_conditioning_latents. (You can cache for faster inference with same speaker) - `temperature`: The softmax temperature of the autoregressive model. Defaults to 0.65. - `length_penalty`: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0. - `repetition_penalty`: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. Defaults to 2.0. - `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50. - `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8. - `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0) - `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True. ##### Inference ```python import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts print("Loading model...") config = XttsConfig() config.load_json("/path/to/xtts/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"]) print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, temperature=0.7, # Add custom parameters here ) torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` ##### Streaming manually Here the goal is to stream the audio as it is being generated. This is useful for real-time applications. Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster. ```python import os import time import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts print("Loading model...") config = XttsConfig() config.load_json("/path/to/xtts/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"]) print("Inference...") t0 = time.time() chunks = model.inference_stream( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding ) wav_chuncks = [] for i, chunk in enumerate(chunks): if i == 0: print(f"Time to first chunck: {time.time() - t0}") print(f"Received chunk {i} of audio length {chunk.shape[-1]}") wav_chuncks.append(chunk) wav = torch.cat(wav_chuncks, dim=0) torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000) ``` ### Training #### Easy training To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps: - Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter - Train the XTTS GPT encoder with the processed data - Inference support using the fine-tuned model The user can run this gradio demo locally or remotely using a Colab Notebook. ##### Run demo on Colab To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook. The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing). To learn how to use this Colab Notebook please check the [XTTS fine-tuning video](). If you are not able to acess the video you need to follow the steps: 1. Open the Colab notebook and start the demo by runining the first two cells (ignore pip install errors in the first one). 2. Click on the link "Running on public URL:" on the second cell output. 3. On the first Tab (1 - Data processing) you need to select the audio file or files, wait for upload, and then click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done. 4. Soon as the dataset processing is done you need to go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. Note that it can take up to 40 minutes. 5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference". ##### Run demo locally To run the demo locally you need to do the following steps: 1. Install 🐸 TTS following the instructions available [here](https://tts.readthedocs.io/en/dev/installation.html#installation). 2. Install the Gradio demo requirements with the command `python3 -m pip install -r TTS/demos/xtts_ft_demo/requirements.txt` 3. Run the Gradio demo using the command `python3 TTS/demos/xtts_ft_demo/xtts_demo.py` 4. Follow the steps presented in the [tutorial video](https://www.youtube.com/watch?v=8tpDiiouGxc&feature=youtu.be) to be able to fine-tune and test the fine-tuned model. If you are not able to access the video, here is what you need to do: 1. On the first Tab (1 - Data processing) select the audio file or files, wait for upload 2. Click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done. 3. Go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. it will take some time. 4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. 5. Now you can run inference with the model by clicking on the button "Step 4 - Inference". #### Advanced training A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py You need to change the fields of the `BaseDatasetConfig` to match your dataset and then update `GPTArgs` and `GPTTrainerConfig` fields as you need. By default, it will use the same parameters that XTTS v1.1 model was trained with. To speed up the model convergence, as default, it will also download the XTTS v1.1 checkpoint and load it. After training you can do inference following the code bellow. ```python import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Add here the xtts_config path CONFIG_PATH = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT-October-23-2023_10+36AM-653f2e75/config.json" # Add here the vocab file that you have used to train the model TOKENIZER_PATH = "recipes/ljspeech/xtts_v1/run/training/XTTS_v2_original_model_files/vocab.json" # Add here the checkpoint that you want to do inference with XTTS_CHECKPOINT = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT/best_model.pth" # Add here the speaker reference SPEAKER_REFERENCE = "LjSpeech_reference.wav" # output wav path OUTPUT_WAV_PATH = "xtts-ft.wav" print("Loading model...") config = XttsConfig() config.load_json(CONFIG_PATH) model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False) model.cuda() print("Computing speaker latents...") gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE]) print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, temperature=0.7, # Add custom parameters here ) torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` ## References and Acknowledgements - VallE: https://arxiv.org/abs/2301.02111 - Tortoise Repo: https://github.com/neonbjb/tortoise-tts - Faster implementation: https://github.com/152334H/tortoise-tts-fast - Univnet: https://arxiv.org/abs/2106.07889 - Latent Diffusion:https://arxiv.org/abs/2112.10752 - DALL-E: https://arxiv.org/abs/2102.12092 - Perceiver: https://arxiv.org/abs/2103.03206 ## XttsConfig ```{eval-rst} .. autoclass:: TTS.tts.configs.xtts_config.XttsConfig :members: ``` ## XttsArgs ```{eval-rst} .. autoclass:: TTS.tts.models.xtts.XttsArgs :members: ``` ## XTTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.xtts.XTTS :members: ``` --- # Training a Model 1. Decide the model you want to use. Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model serves your needs. Other than referring to the papers, one easy way is to test the 🐸TTS community models and see how fast and good each of the models. Or you can start a discussion on our communication channels. 2. Understand the configuration, its fields and values. For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it. 3. Check the recipes. Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`. ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py ``` You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig` fields as you need. 4. Run the training. ```bash $ CUDA_VISIBLE_DEVICES="0" python train_glowtts.py ``` Notice that we set the GPU for the training by `CUDA_VISIBLE_DEVICES` environment variable. To see available GPUs on your system, you can use `nvidia-smi` command on the terminal. If you like to run a multi-gpu training using DDP back-end, ```bash $ CUDA_VISIBLE_DEVICES="0, 1, 2" python -m trainer.distribute --script /train_glowtts.py ``` The example above runs a multi-gpu training using GPUs `0, 1, 2`. Beginning of a training log looks like this: ```console > Experiment folder: /your/output_path/-Juni-23-2021_02+52-78899209 > Using CUDA: True > Number of GPUs: 1 > Setting up Audio Processor... | > sample_rate:22050 | > resample:False | > num_mels:80 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:None | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:45 | > do_sound_norm:False | > stats_path:None | > base:10 | > hop_length:256 | > win_length:1024 | > Found 13100 files in /your/dataset/path/ljspeech/LJSpeech-1.1 > Using model: glow_tts > Model has 28356129 parameters > EPOCH: 0/1000 > DataLoader initialization | > Use phonemes: False | > Number of instances : 12969 | > Max length sequence: 187 | > Min length sequence: 5 | > Avg length sequence: 98.3403500655409 | > Num. instances discarded by max-min (max=500, min=3) seq limits: 0 | > Batch group size: 0. > TRAINING (2021-06-23 14:52:54) --> STEP: 0/405 -- GLOBAL_STEP: 0 | > loss: 2.34670 | > log_mle: 1.61872 | > loss_dur: 0.72798 | > align_error: 0.52744 | > current_lr: 2.5e-07 | > grad_norm: 5.036039352416992 | > step_time: 5.8815 | > loader_time: 0.0065 ... ``` 5. Run the Tensorboard. ```bash $ tensorboard --logdir= ``` 6. Monitor the training progress. On the terminal and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs. Note that different models have different metrics, visuals and outputs. You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions that occur in a training. 7. Use your best model for inference. Use `tts` or `tts-server` commands for testing your models. ```bash $ tts --text "Text for TTS" \ --model_path path/to/checkpoint_x.pth \ --config_path path/to/config.json \ --out_path folder/to/save/output.wav ``` 8. Return to the step 1 and reiterate for training a `vocoder` model. In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models. # Multi-speaker Training Training a multi-speaker model is mostly the same as training a single-speaker model. You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model. The configuration parameters define whether you want to train the model with a speaker-embedding layer or pre-computed d-vectors. For using d-vectors, you first need to compute the d-vectors using the `SpeakerEncoder`. The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below. ```{literalinclude} ../../recipes/vctk/glow_tts/train_glow_tts.py ``` --- # TTS Datasets Some of the known public datasets that we successfully applied 🐸TTS: - [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/) - [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) - [English - TWEB](https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset) - [English - LibriTTS](https://openslr.org/60/) - [English - VCTK](https://datashare.ed.ac.uk/handle/10283/2950) - [Multilingual - M-AI-Labs](http://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) - [Spanish](https://drive.google.com/file/d/1Sm_zyBo67XHkiFhcRSQ4YaHPYM0slO_e/view?usp=sharing) - thx! @carlfm01 - [German - Thorsten OGVD](https://github.com/thorstenMueller/deep-learning-german-tts) - [Japanese - Kokoro](https://www.kaggle.com/kaiida/kokoro-speech-dataset-v11-small/version/1) - [Chinese](https://www.data-baker.com/data/index/source/) - [Ukrainian - LADA](https://github.com/egorsmkv/ukrainian-tts-datasets/tree/main/lada) Let us know if you use 🐸TTS on a different dataset. --- # Tutorial For Nervous Beginners ## Installation User friendly installation. Recommended only for synthesizing voice. ```bash $ pip install TTS ``` Developer friendly installation. ```bash $ git clone https://github.com/coqui-ai/TTS $ cd TTS $ pip install -e . ``` ## Training a `tts` Model A breakdown of a simple script that trains a GlowTTS model on the LJspeech dataset. See the comments for more details. ### Pure Python Way 0. Download your dataset. In this example, we download and use the LJSpeech dataset. Set the download directory based on your preferences. ```bash $ python -c 'from TTS.utils.downloaders import download_ljspeech; download_ljspeech("../recipes/ljspeech/");' ``` 1. Define `train.py`. ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py ``` 2. Run the script. ```bash CUDA_VISIBLE_DEVICES=0 python train.py ``` - Continue a previous run. ```bash CUDA_VISIBLE_DEVICES=0 python train.py --continue_path path/to/previous/run/folder/ ``` - Fine-tune a model. ```bash CUDA_VISIBLE_DEVICES=0 python train.py --restore_path path/to/model/checkpoint.pth ``` - Run multi-gpu training. ```bash CUDA_VISIBLE_DEVICES=0,1,2 python -m trainer.distribute --script train.py ``` ### CLI Way We still support running training from CLI like in the old days. The same training run can also be started as follows. 1. Define your `config.json` ```json { "run_name": "my_run", "model": "glow_tts", "batch_size": 32, "eval_batch_size": 16, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": -1, "epochs": 1000, "text_cleaner": "english_cleaners", "use_phonemes": false, "phoneme_language": "en-us", "phoneme_cache_path": "phoneme_cache", "print_step": 25, "print_eval": true, "mixed_precision": false, "output_path": "recipes/ljspeech/glow_tts/", "datasets":[{"formatter": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}] } ``` 2. Start training. ```bash $ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json ``` ## Training a `vocoder` Model ```{literalinclude} ../../recipes/ljspeech/hifigan/train_hifigan.py ``` ❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above. ## Synthesizing Speech You can run `tts` and synthesize speech directly on the terminal. ```bash $ tts -h # see the help $ tts --list_models # list the available models. ``` ![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif) You can call `tts-server` to start a local demo server that you can open it on your favorite web browser and 🗣️. ```bash $ tts-server -h # see the help $ tts-server --list_models # list the available models. ``` ![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif) --- (what_makes_a_good_dataset)= # What makes a good TTS dataset ## What Makes a Good Dataset * **Gaussian like distribution on clip and text lengths**. So plot the distribution of clip lengths and check if it covers enough short and long voice clips. * **Mistake free**. Remove any wrong or broken files. Check annotations, compare transcript and audio length. * **Noise free**. Background noise might lead your model to struggle, especially for a good alignment. Even if it learns the alignment, the final result is likely to be suboptimial. * **Compatible tone and pitch among voice clips**. For instance, if you are using audiobook recordings for your project, it might have impersonations for different characters in the book. These differences between samples downgrade the model performance. * **Good phoneme coverage**. Make sure that your dataset covers a good portion of the phonemes, di-phonemes, and in some languages tri-phonemes. * **Naturalness of recordings**. For your model WISIAIL (What it sees is all it learns). Therefore, your dataset should accommodate all the attributes you want to hear from your model. ## Preprocessing Dataset If you like to use a bespoken dataset, you might like to perform a couple of quality checks before training. 🐸TTS provides a couple of notebooks (CheckSpectrograms, AnalyzeDataset) to expedite this part for you. * **AnalyzeDataset** is for checking dataset distribution in terms of the clip and transcript lengths. It is good to find outlier instances (too long, short text but long voice clip, etc.)and remove them before training. Keep in mind that we like to have a good balance between long and short clips to prevent any bias in training. If you have only short clips (1-3 secs), then your model might suffer for long sentences and if your instances are long, then it might not learn the alignment or might take too long to train the model. * **CheckSpectrograms** is to measure the noise level of the clips and find good audio processing parameters. The noise level might be observed by checking spectrograms. If spectrograms look cluttered, especially in silent parts, this dataset might not be a good candidate for a TTS project. If your voice clips are too noisy in the background, it makes things harder for your model to learn the alignment, and the final result might be different than the voice you are given. If the spectrograms look good, then the next step is to find a good set of audio processing parameters, defined in ```config.json```. In the notebook, you can compare different sets of parameters and see the resynthesis results in relation to the given ground-truth. Find the best parameters that give the best possible synthesis performance. Another practical detail is the quantization level of the clips. If your dataset has a very high bit-rate, that might cause slow data-load time and consequently slow training. It is better to reduce the sample-rate of your dataset to around 16000-22050.