# Amphion > Amphion provides comprehensive objective evaluation capabilities and multiple state-of-the-art neural vocoders for audio synthesis tasks. --- # Evaluation Metrics and Vocoders in Amphion ## Overview Amphion provides comprehensive objective evaluation capabilities and multiple state-of-the-art neural vocoders for audio synthesis tasks. ## Evaluation Metrics Amphion implements a complete set of evaluation metrics for assessing audio generation quality across multiple dimensions: ### F0 Modeling Metrics Evaluate pitch/fundamental frequency accuracy: - **F0 Pearson Coefficient**: Correlation between predicted and ground truth F0 - Range: -1.0 to 1.0 - Higher is better - Measures pitch contour tracking - **F0 Periodicity RMSE**: Root Mean Square Error for voiced/unvoiced detection - Measures periodicity accuracy - Lower is better - **F0 RMSE**: Root Mean Square Error for pitch value prediction - Measures absolute pitch accuracy - Lower is better - **Voiced/Unvoiced F1 Score**: Binary classification accuracy - Measures ability to detect voiced vs unvoiced segments - Range: 0-1, higher is better ```python from amphion.evaluation import F0Metrics f0_metrics = F0Metrics() # Compute metrics pearson_coef = f0_metrics.pearson_coefficient(pred_f0, gt_f0) voicing_f1 = f0_metrics.voicing_f1(pred_voicing, gt_voicing) rmse = f0_metrics.f0_rmse(pred_f0, gt_f0) ``` ### Energy Modeling Metrics Evaluate energy/amplitude accuracy: - **Energy RMSE**: Root Mean Square Error for energy prediction - Lower is better - Measures amplitude accuracy - **Energy Pearson Coefficient**: Correlation with ground truth energy - Range: -1.0 to 1.0 - Higher is better ```python from amphion.evaluation import EnergyMetrics energy_metrics = EnergyMetrics() rmse = energy_metrics.energy_rmse(pred_energy, gt_energy) pearson = energy_metrics.pearson_coefficient(pred_energy, gt_energy) ``` ### Intelligibility Metrics Measure content preservation and clarity: - **Character Error Rate (CER)**: Character-level WER - Requires ASR model (Whisper) - Lower is better (0% = perfect) - **Word Error Rate (WER)**: Word-level error rate - Requires ASR model (Whisper) - Lower is better (0% = perfect) ```python from amphion.evaluation import IntelligibilityMetrics from amphion.models import WhisperExtractor intelligibility = IntelligibilityMetrics( asr_model=WhisperExtractor('base') ) wer = intelligibility.word_error_rate(audio, reference_text) cer = intelligibility.character_error_rate(audio, reference_text) ``` ### Spectrogram Distortion Metrics Measure audio quality and similarity: #### Frechet Audio Distance (FAD) - **Purpose**: Perceptual audio quality metric - **Range**: 0 to infinity (lower is better) - **Based on**: VGGish audio feature embeddings - **Interpretation**: Distance between distributions ```python from amphion.evaluation import FADMetrics fad_metrics = FADMetrics() # Compute FAD fad = fad_metrics.compute(generated_audio, reference_audio) # Typical good value: < 3.0 ``` #### Mel-Cepstral Distortion (MCD) - **Purpose**: Spectral similarity measure - **Range**: 0 to infinity (lower is better) - **Best For**: Voice conversion, TTS - **Unit**: dB ```python from amphion.evaluation import MCDMetrics mcd_metrics = MCDMetrics() # Compute MCD mcd = mcd_metrics.compute(predicted, reference) # Typical good value: < 5.0 dB ``` #### Multi-Resolution STFT Distance (MSTFT) - **Purpose**: Multi-scale spectral comparison - **Uses**: Multiple window sizes and FFT lengths - **Range**: 0 to infinity (lower is better) ```python from amphion.evaluation import MSTFTMetrics mstft_metrics = MSTFTMetrics() mag_loss, phase_loss = mstft_metrics.compute(predicted, reference) ``` #### PESQ (Perceptual Evaluation of Speech Quality) - **Purpose**: Subjective speech quality prediction - **Range**: -0.5 to 4.5 (higher is better) - **Best For**: Speech synthesis quality - **Correlation**: High correlation with MOS ```python from amphion.evaluation import PESQMetrics pesq_metrics = PESQMetrics() score = pesq_metrics.compute(reference, generated) # Typical good value: > 3.0 ``` #### STOI (Short Time Objective Intelligibility) - **Purpose**: Speech intelligibility metric - **Range**: 0 to 1 (higher is better) - **Based on**: SNR estimates in bark bands ```python from amphion.evaluation import STOIMetrics stoi_metrics = STOIMetrics() score = stoi_metrics.compute(reference, generated) # Typical good value: > 0.8 ``` ### Speaker Similarity Metrics Measure speaker identity preservation: Supported speaker verification models: - **RawNet3**: End-to-end speaker recognition - **Resemblyzer**: Simple speaker embedding - **WeSpeaker**: WeNet speaker embedding - **WavLM**: Large multilingual model ```python from amphion.evaluation import SpeakerSimilarityMetrics from amphion.models import RawNet3 speaker_metrics = SpeakerSimilarityMetrics( extractor=RawNet3() ) # Cosine similarity similarity = speaker_metrics.cosine_similarity(audio1, audio2) # Range: -1 to 1 (higher = more similar speaker) ``` ## Evaluation Workflow ### Complete Evaluation Script ```python from amphion.evaluation import ( FADMetrics, MCDMetrics, PESQMetrics, SpeakerSimilarityMetrics, IntelligibilityMetrics ) import numpy as np # Initialize metrics fad = FADMetrics() mcd = MCDMetrics() pesq = PESQMetrics() speaker_sim = SpeakerSimilarityMetrics() intelligibility = IntelligibilityMetrics() # Load generated and reference audio generated_audio = load_audio('generated.wav') reference_audio = load_audio('reference.wav') # Compute all metrics results = { 'fad': fad.compute(generated_audio, reference_audio), 'mcd': mcd.compute(generated_audio, reference_audio), 'pesq': pesq.compute(reference_audio, generated_audio), 'speaker_sim': speaker_sim.cosine_similarity( generated_audio, reference_audio ), } # Print results for metric, value in results.items(): print(f"{metric}: {value:.4f}") ``` ### Batch Evaluation ```bash python bins/metrics/eval.py \ --config config/tts/VITS/vits.yaml \ --checkpoint checkpoints/vits.pt \ --test-dir path/to/test/data \ --output-csv results.csv ``` Configuration for batch evaluation: ```yaml evaluation: metrics: - fad - mcd - pesq - speaker_similarity - intelligibility fad: model: vggish pesq: sample_rate: 16000 speaker_similarity: extractor: rawnet3 intelligibility: asr_model: whisper-base ``` ## Neural Vocoders Vocoders convert acoustic features (mel-spectrograms) to waveforms. Amphion supports multiple vocoder architectures. ### GAN-Based Vocoders #### HiFi-GAN **Paper**: https://arxiv.org/abs/2010.05646 **Key Features**: - High-quality audio generation - Fast inference - Lightweight architecture - Multi-scale discriminators **Configuration**: ```yaml vocoder: type: hifigan pretrained: true checkpoint: pretrained/vocoders/hifigan.pt ``` **Performance**: - MOS: ~3.8-4.0 - Inference speed: Real-time (>10x) - Model size: ~3.6M parameters #### NSF-HiFiGAN **Enhancement**: NSF (Neural Source Filter) + HiFi-GAN **Key Features**: - Improved pitch accuracy - Better F0 modeling - Faster convergence ```yaml vocoder: type: nsf_hifigan f0_quantizer: linear ``` #### BigVGAN **Paper**: https://arxiv.org/abs/2206.04658 **Key Features**: - Larger capacity model - Superior audio quality - Better generalization - Improved high-frequency content ```yaml vocoder: type: bigvgan pretrained: true ``` #### APNet **Paper**: https://arxiv.org/abs/2305.07952 **Key Features**: - Adaptive parallel architecture - Efficient design - High-quality output ```yaml vocoder: type: apnet ``` #### MelGAN **Lightweight option for fast inference** **Key Features**: - Small model size - Fast inference - Mobile-friendly ```yaml vocoder: type: melgan ``` ### Flow-Based Vocoders #### WaveGlow **Paper**: https://arxiv.org/abs/1811.00002 **Key Features**: - Normalizing flow model - Parallel generation - Invertible transformation **Configuration**: ```yaml vocoder: type: waveglow n_flows: 12 n_group: 8 ``` ### Diffusion-Based Vocoders #### Diffwave **Paper**: https://arxiv.org/abs/2009.09761 **Key Features**: - Diffusion-based generation - High-quality audio - Slower inference **Configuration**: ```yaml vocoder: type: diffwave num_steps: 50 sampler: ddim ``` ### Auto-Regressive Vocoders #### WaveNet **Paper**: https://arxiv.org/abs/1609.03499 **Key Features**: - Dilated convolutions - Causal generation - High quality but slow #### WaveRNN **Paper**: https://arxiv.org/abs/1802.08435 **Key Features**: - Efficient RNN-based - Faster than WaveNet - Still slower than GAN-based ## Vocoder Training ### Training a Custom Vocoder ```bash python bins/train.py \ --config config/vocoder/hifigan/train.yaml \ --exp-name my_vocoder ``` Configuration: ```yaml model: type: HiFiGAN generator: channels: 512 upsample_scales: [8, 8, 2, 2] upsample_kernel_sizes: [16, 16, 4, 4] discriminator: scales: 3 periods: [2, 3, 5, 7, 11] data: dataset: libritts sample_rate: 16000 batch_size: 32 train: learning_rate_g: 0.0002 learning_rate_d: 0.0002 betas: [0.5, 0.9] max_epochs: 100 ``` ### Vocoder Evaluation ```python from amphion.models import build_vocoder from amphion.evaluation import PESQMetrics # Load vocoder vocoder = build_vocoder('hifigan') # Convert mel-spectrogram to audio mel_spec = load_mel_spectrogram('test.pt') audio = vocoder(mel_spec) # Evaluate pesq_metrics = PESQMetrics() score = pesq_metrics.compute(reference_audio, audio) ``` ## Vocoder Selection Guide | Vocoder | Quality | Speed | Size | Best For | |---------|---------|-------|------|----------| | HiFi-GAN | High | Very Fast | Small | General purpose | | NSF-HiFiGAN | High | Very Fast | Small | Pitch-critical tasks | | BigVGAN | Very High | Fast | Medium | High-quality output | | APNet | High | Very Fast | Small | Efficient systems | | MelGAN | Medium | Very Fast | Tiny | Mobile/edge | | WaveGlow | High | Medium | Large | Parallel generation | | Diffwave | Very High | Slow | Medium | Offline generation | | WaveNet | Very High | Slow | Large | Research | ## Advanced Evaluation ### Custom Metrics Implement custom metrics: ```python from amphion.evaluation import AudioMetric class CustomMetric(AudioMetric): def __init__(self): super().__init__() def compute(self, predicted, reference): # Your metric implementation return metric_value # Use custom metric custom = CustomMetric() value = custom.compute(generated_audio, reference_audio) ``` ### Listening Tests Integration Amphion can organize audio samples for listening tests: ```python from amphion.evaluation import ListeningTestOrganizer organizer = ListeningTestOrganizer( models=['model1', 'model2', 'model3'], reference_audios=['ref1.wav', 'ref2.wav'], output_dir='listening_test' ) # Generates HTML interface for MOS collection organizer.generate_mos_interface() ``` ## Resources - **Evaluation Code**: https://github.com/open-mmlab/Amphion/tree/main/egs/metrics/ - **Paper**: https://arxiv.org/abs/2312.09911 - **Multi-Scale CQT Discriminator**: https://arxiv.org/abs/2311.14957 - **Community**: https://discord.com/invite/drhW7ajqAG --- # Amphion: Audio, Music, and Speech Generation Toolkit **Source:** https://github.com/open-mmlab/Amphion ## Overview Amphion (/æmˈfaɪən/) is an open-source deep learning toolkit for audio, music, and speech generation research and development. It is designed to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation. The toolkit offers unique visualizations of classic models and architectures, providing invaluable educational resources for understanding neural audio processing. ## Purpose The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. It is designed to support multiple individual generation tasks with a unified framework and pipeline. ## Supported Tasks Amphion provides comprehensive support for the following audio generation tasks: - **TTS (Text-to-Speech)** - Supported - Convert text to natural-sounding speech - Multiple supported architectures with state-of-the-art performance - **SVC (Singing Voice Conversion)** - Supported - Convert singing voice from one speaker/style to another - Multiple acoustic decoder implementations - **VC (Voice Conversion)** - Supported - Zero-shot and few-shot voice conversion - Controllable timbre and style conversion - **AC (Accent Conversion)** - Supported - Convert accents in speech while preserving content - Zero-shot capability for style conversion - **TTA (Text-to-Audio)** - Supported - Generate audio from textual descriptions - Latent diffusion model architecture - **SVS (Singing Voice Synthesis)** - In Development - Convert text directly to singing voice - **TTM (Text-to-Music)** - In Development - Generate music from textual descriptions ## Key Features ### TTS: Text-to-Speech Amphion achieves state-of-the-art performance on TTS systems with multiple supported architectures: - **FastSpeech2**: Non-autoregressive architecture using feed-forward Transformer blocks - **VITS**: End-to-end architecture with conditional VAE and adversarial learning - **VALL-E**: Zero-shot TTS using neural codec language model with discrete codes - **NaturalSpeech2**: Architecture using latent diffusion models for natural-sounding voices - **Jets**: End-to-end model jointly training FastSpeech2 and HiFi-GAN with alignment - **MaskGCT**: Fully non-autoregressive architecture eliminating explicit alignment requirements - **Vevo-TTS**: Zero-shot TTS with controllable timbre and style ### Voice Conversion & Imitation - **Vevo**: Zero-shot voice imitation framework with controllable timbre and style - Vevo-Timbre: Style-preserved voice conversion - Vevo-Voice: Style-converted voice conversion - **FACodec**: Decomposes speech into subspaces for content, prosody, and timbre - Achieves zero-shot voice conversion - **Noro**: Noise-robust zero-shot voice conversion system - Handles noisy reference speeches - Dual-branch reference encoding ### Singing Voice Conversion Amphion implements multiple speaker-agnostic feature representations: - **Content Features**: From WeNet, Whisper, and ContentVec pretrained models - **Prosody Features**: F0 and energy extraction - **Acoustic Decoders**: - Diffusion-based: DiffWaveNetSVC, DiffComoSVC (Consistency Model) - Transformer-based: TransformerSVC (encoder-only, non-autoregressive) - VAE/Flow-based: VitsSVC (VITS-like architecture) ### Text-to-Audio Generation - Latent diffusion model architecture - Two-stage training: VAE (AutoencoderKL) and conditional diffusion (AudioLDM) - Similar to AudioLDM, Make-an-Audio, and AUDIT frameworks ### Neural Audio Codecs - **DualCodec**: Low-frame-rate (12.5Hz or 25Hz) codec with SSL features - **FACodec**: Speech decomposition for content, prosody, and timbre ### Vocoders Amphion supports multiple neural vocoder architectures: - **GAN-based**: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet - **Flow-based**: WaveGlow - **Diffusion-based**: Diffwave - **Auto-regressive**: WaveNet, WaveRNN - **Multi-Scale Constant-Q Transform Discriminator**: Enhancement for GAN vocoders (ICASSP 2024) ### Evaluation Metrics Comprehensive objective evaluation capabilities: - **F0 Modeling**: F0 Pearson Coefficients, Periodicity RMSE, Voiced/Unvoiced F1 Score - **Energy Modeling**: Energy RMSE, Energy Pearson Coefficients - **Intelligibility**: Character/Word Error Rate (via Whisper) - **Spectrogram Distortion**: FAD, MCD, Multi-Resolution STFT Distance, PESQ, STOI - **Speaker Similarity**: Cosine similarity (RawNet3, Resemblyzer, WeSpeaker, WavLM) ### Datasets Amphion provides unified data preprocessing for open-source datasets: - AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK - **Emilia Dataset**: Exclusive support for in-the-wild speech data - 101k+ hours of multilingual speech data - Latest Emilia-Large: 200,000+ hours (Emilia + Emilia-YODAS) - **Emilia-Pipe**: Preprocessing pipeline for in-the-wild speech data ### Visualization Tools - **SingVisio**: Interactive visualization tool for diffusion models in singing voice conversion - Educational resource for understanding model internals - Facilitates understandable research ## Latest Releases ### Amphion v0.2 (January 2025) - Comprehensive technical report covering 2024 updates - Emilia-Large dataset (200k+ hours) - Enhanced multilingual support - Multiple new model releases ### Recent Model Releases - **DualCodec** (May 2025): Low-frame-rate neural audio codec - **Vevo1.5** (April 2025): Unified speech and singing voice generation - **Metis** (February 2025): Foundation model for unified speech generation - **MaskGCT** (October 2024): State-of-the-art non-autoregressive TTS - **Vevo** (December 2024): Zero-shot voice imitation framework ## Pre-trained Models Amphion provides pre-trained models available on: - HuggingFace: https://huggingface.co/amphion - ModelScope: https://modelscope.cn/organization/amphion - OpenXLab: https://openxlab.org.cn/usercenter/Amphion All models are released under the MIT License for both research and commercial use. ## Community & Resources - **GitHub**: https://github.com/open-mmlab/Amphion - **Discord**: Join the community at https://discord.com/invite/drhW7ajqAG - **Paper**: https://arxiv.org/abs/2312.09911 - **Website**: https://amphion.dev - **HuggingFace Demos**: Interactive demos available for multiple models ## License Amphion is released under the MIT License, allowing free use for both research and commercial applications. --- # Amphion Installation Guide ## Overview Amphion can be installed through two methods: 1. Setup Installer (Python environment) 2. Docker Image (containerized with GPU support) ## Method 1: Setup Installer ### Prerequisites - Git - Conda (Anaconda or Miniconda) - Python 3.9+ (recommended: 3.9.15) - CUDA toolkit (for GPU support) - cuDNN (for GPU support) ### Installation Steps #### Step 1: Clone the Repository ```bash git clone https://github.com/open-mmlab/Amphion.git cd Amphion ``` #### Step 2: Create Conda Environment ```bash conda create --name amphion python=3.9.15 conda activate amphion ``` #### Step 3: Install Dependencies Amphion provides an installation script that handles all Python package dependencies: ```bash sh env.sh ``` This script will install: - Core dependencies (PyTorch, torchaudio, librosa) - Model dependencies (diffusers, transformers, julius) - Audio processing (soundfile, scipy, matplotlib) - Data processing (numpy, pandas) - ML utilities (lightning, tensorboard, wandb) #### Step 4: Verify Installation To verify your installation is working: ```bash python -c "import amphion; print('Amphion installed successfully')" ``` ### Troubleshooting **CUDA/GPU Issues**: If you encounter CUDA errors, ensure you have: - Compatible NVIDIA drivers installed - CUDA toolkit matching your PyTorch installation - cuDNN properly configured **Memory Issues**: If you encounter out-of-memory errors: - Reduce batch size in configuration files - Use gradient accumulation - Enable gradient checkpointing ## Method 2: Docker Installation ### Prerequisites - Docker - NVIDIA Driver (latest version recommended) - NVIDIA Container Toolkit - CUDA toolkit (compatible with your NVIDIA driver) ### Installation Steps #### Step 1: Install Docker Dependencies If not already installed: ```bash # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh # Install NVIDIA Container Toolkit distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker ``` #### Step 2: Clone Repository and Pull Docker Image ```bash git clone https://github.com/open-mmlab/Amphion.git cd Amphion docker pull realamphion/amphion ``` #### Step 3: Run Docker Container Run the Docker container with GPU support: ```bash docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion ``` #### Step 4: Mount Datasets To use your own datasets with Docker, mount them as volumes: ```bash docker run --runtime=nvidia --gpus all -it \ -v .:/app \ -v /path/to/datasets:/app/datasets \ realamphion/amphion ``` For detailed Docker volume mounting instructions, see: - [Mount Dataset in Docker Container](../egs/datasets/docker.md) - [Docker Documentation](https://docs.docker.com/engine/reference/commandline/container_run/#volume) ### Available Docker Images The official Docker image includes: - Pre-installed PyTorch with CUDA support - All Amphion dependencies and models - NVIDIA CUDA runtime - Ready-to-use development environment ## System Requirements ### Minimum Requirements - **CPU**: 4+ cores - **RAM**: 8GB (16GB+ recommended) - **GPU**: NVIDIA GPU with 2GB+ VRAM (for inference) - 8GB+ VRAM recommended for training - **Disk Space**: 20GB+ for models and datasets ### Recommended Configuration - **CPU**: 8+ cores - **RAM**: 32GB+ - **GPU**: NVIDIA GPU with 24GB+ VRAM (for training) - RTX 3090, RTX 4090, H100, or A100 recommended - **Storage**: SSD with 100GB+ free space ## Quick Start After Installation ### Python Usage After installation, use Amphion in your Python code: ```python from amphion.utils import load_config from amphion.models import build_model # Load configuration config = load_config('path/to/config.yaml') # Build and use model model = build_model(config) ``` ### Command Line Usage Access Amphion's CLI tools: ```bash # Activate environment conda activate amphion # Run preprocessing python bins/data/preprocess_dataset.py --config config/... # Train a model python bins/train.py --config config/... # Inference python bins/inference.py --config config/... ``` ### Docker Usage Inside Docker container: ```bash cd /app # Run preprocessing python bins/data/preprocess_dataset.py --config config/... # Train a model python bins/train.py --config config/... # Exit container exit ``` ## Configuration Files Amphion uses YAML configuration files for all tasks. Configuration templates are located in: ``` Amphion/ ├── config/ │ ├── tts/ # Text-to-Speech configs │ ├── svc/ # Singing Voice Conversion configs │ ├── vc/ # Voice Conversion configs │ ├── tta/ # Text-to-Audio configs │ └── vocoder/ # Vocoder configs ``` ## Environment Variables Optional environment variables for advanced configuration: ```bash # Set number of CPU threads export OMP_NUM_THREADS=8 # Set CUDA device export CUDA_VISIBLE_DEVICES=0,1 # Enable mixed precision export AMPHION_MIXED_PRECISION=fp16 ``` The `env.sh` script is provided to set up common environment variables: ```bash source env.sh ``` ## Next Steps After successful installation: 1. **Choose a Task**: - [Text-to-Speech (TTS)](../egs/tts/README.md) - [Singing Voice Conversion (SVC)](../egs/svc/README.md) - [Voice Conversion (VC)](../models/vc/vevo/README.md) - [Text-to-Audio (TTA)](../egs/tta/README.md) 2. **Download Datasets**: Check available preprocessed datasets in `egs/datasets/README.md` 3. **Run Examples**: Start with provided recipes and examples 4. **Join Community**: Participate in discussions on [Discord](https://discord.com/invite/drhW7ajqAG) ## Getting Help - **GitHub Issues**: https://github.com/open-mmlab/Amphion/issues - **Discord Community**: https://discord.com/invite/drhW7ajqAG - **Documentation**: https://amphion.dev - **Papers & Reports**: https://arxiv.org/search/?query=amphion --- # Amphion Quick Reference Guide ## Repository Structure ``` Amphion/ ├── bins/ # Command-line scripts │ ├── train.py # Training entrypoint │ ├── inference.py # Inference entrypoint │ └── metrics/ # Evaluation scripts ├── config/ # Configuration files (YAML) │ ├── tts/ # Text-to-Speech configs │ ├── svc/ # Singing Voice Conversion configs │ ├── vc/ # Voice Conversion configs │ ├── tta/ # Text-to-Audio configs │ └── vocoder/ # Vocoder configs ├── models/ # Model implementations │ ├── tts/ # TTS models │ ├── vc/ # VC models (Vevo, FACodec, etc.) │ ├── svc/ # SVC models │ ├── codec/ # Neural codecs │ └── vocoders/ # Vocoders ├── modules/ # Neural network modules ├── preprocessors/ # Data preprocessing │ └── Emilia/ # Emilia dataset preprocessing ├── evaluation/ # Evaluation metrics ├── egs/ # Example recipes │ ├── tts/ # TTS recipes │ ├── svc/ # SVC recipes │ ├── vc/ # VC recipes │ ├── tta/ # TTA recipes │ ├── datasets/ # Dataset instructions │ ├── metrics/ # Evaluation guides │ └── visualization/ # SingVisio visualization └── pretrained/ # Pre-trained model checkpoints ``` ## Essential Commands ### Installation ```bash # Clone repository git clone https://github.com/open-mmlab/Amphion.git cd Amphion # Setup Python environment conda create --name amphion python=3.9.15 conda activate amphion # Install dependencies sh env.sh # Docker alternative docker pull realamphion/amphion docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion ``` ### Data Preparation ```bash # Generic dataset preprocessing python bins/data/preprocess_dataset.py \ --config config/tts/VITS/prepare_libritts.yaml \ --datasets libritts # Emilia dataset preprocessing python bins/data/preprocess_dataset.py \ --config config/preprocessors/Emilia/emilia_pipe.yaml \ --raw-data-dir /path/to/raw/audio ``` ### Training ```bash # Basic training python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_experiment # Resume training python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_experiment \ --resume # Distributed training (8 GPUs) python -m torch.distributed.launch \ --nproc_per_node=8 \ bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_experiment # Mixed precision training python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_experiment \ --mixed_precision fp16 ``` ### Inference ```bash # TTS inference python bins/inference.py \ --config config/tts/VITS/vits.yaml \ --checkpoint checkpoints/my_model/ckpt.pt \ --text "Your text here" \ --output output.wav # Voice Conversion inference python bins/inference.py \ --config config/vc/vevo/vevo.yaml \ --checkpoint checkpoints/vevo.pt \ --source-audio source.wav \ --reference-audio reference.wav \ --output output.wav # SVC inference python bins/inference.py \ --config config/svc/DiffComoSVC/diffcomosvc.yaml \ --checkpoint checkpoints/svc.pt \ --source-audio source.wav \ --target-speaker speaker_id \ --output output.wav ``` ### Evaluation ```bash # Evaluate model python bins/metrics/eval.py \ --config config/tts/VITS/vits.yaml \ --checkpoint checkpoints/my_model/ckpt.pt \ --test-dir test_data/ \ --output metrics.json # Compute FAD score python bins/metrics/compute_fad.py \ --generated-dir generated_audio/ \ --reference-dir reference_audio/ # ASR evaluation (Word Error Rate) python bins/metrics/compute_asr.py \ --audio-dir generated_audio/ \ --reference-text reference_text.txt ``` ## Configuration Quick Reference ### Common Config Structure ```yaml # Model definition model: type: VITS # Model architecture hidden_size: 384 encoder_hidden_size: 384 num_mels: 80 # Data loading data: dataset: libritts data_dir: /path/to/data batch_size: 16 num_workers: 4 pin_memory: true # Training train: max_epochs: 100 learning_rate: 1e-3 optimizer: adam betas: [0.9, 0.999] weight_decay: 0.0 grad_clip: 5.0 grad_accumulation_steps: 1 # Validation valid: interval: 5000 num_samples: 10 # Checkpointing ckpt: keep_last: 3 keep_best_by_state_dict: true # Logging log: log_interval: 10 log_tensorboard: true ``` ### Task-Specific Configs #### TTS Configuration ```yaml model: type: VITS # or FastSpeech2, VALL-E, Jets # Speaker information (for multi-speaker) speaker: num_speakers: 100 embedding_dim: 256 # Vocoder vocoder: type: hifigan checkpoint: pretrained/vocoders/hifigan.pt ``` #### SVC Configuration ```yaml # Content feature extractor acoustic_features: content_feature: type: whisper # or weinet, contentvec prosody: extract_f0: true extract_energy: true # Acoustic decoder acoustic_decoder: type: DiffComoSVC # Speaker info speaker: num_speakers: 100 embedding_dim: 256 ``` #### VC Configuration ```yaml model: type: Vevo # or FACodec, Noro # Inference settings inference: mode: timbre # or voice pitch_scale: 1.0 energy_scale: 1.0 ``` ## Quick Start Recipes ### TTS with VITS ```bash # 1. Prepare data python bins/data/preprocess_dataset.py \ --config config/tts/VITS/prepare_libritts.yaml \ --datasets libritts # 2. Train python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name vits_libritts # 3. Infer python bins/inference.py \ --config config/tts/VITS/vits.yaml \ --checkpoint checkpoints/vits_libritts/ckpt.pt \ --text "Hello, this is a test." \ --output output.wav ``` ### SVC with DiffComoSVC ```bash # 1. Prepare dataset python bins/data/preprocess_dataset.py \ --config config/svc/prepare_svcc.yaml \ --datasets svcc # 2. Extract features python bins/data/extract_acoustic_features.py \ --config config/svc/extract_whisper_feature.yaml # 3. Train python bins/train.py \ --config config/svc/DiffComoSVC/diffcomosvc.yaml \ --exp-name diffcomosvc_svcc # 4. Infer python bins/inference.py \ --config config/svc/DiffComoSVC/diffcomosvc.yaml \ --checkpoint checkpoints/diffcomosvc_svcc/ckpt.pt \ --source-audio source.wav \ --target-speaker 1 \ --output output.wav ``` ### Voice Conversion with Pre-trained Model ```bash # Download and use pre-trained Vevo python -c " from amphion.models import build_model model = build_model(config) model.load_pretrained('amphion/vevo') " # Run inference python bins/inference.py \ --config config/vc/vevo/vevo.yaml \ --checkpoint pretrained/vevo/vevo.pt \ --source-audio source.wav \ --reference-audio reference.wav \ --output output.wav ``` ## Common File Locations | Item | Location | |------|----------| | Training scripts | `bins/train.py` | | Inference scripts | `bins/inference.py` | | Evaluation scripts | `bins/metrics/` | | Data preprocessing | `bins/data/` | | Model weights | `pretrained/` | | TTS configs | `config/tts/` | | SVC configs | `config/svc/` | | VC configs | `config/vc/` | | TTA configs | `config/tta/` | | Vocoder configs | `config/vocoder/` | | Model code | `models/` | | Dataset recipes | `egs/datasets/` | | Example configs | `egs//` | ## Environment Variables ```bash # Set CUDA devices export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set number of CPU threads export OMP_NUM_THREADS=8 # Enable mixed precision export AMPHION_MIXED_PRECISION=fp16 # Set random seed for reproducibility export PYTHONHASHSEED=0 # Enable deterministic behavior export CUBLAS_WORKSPACE_CONFIG=:16:8 ``` ## Dataset Paths ### Pre-configured Datasets ```bash # Place datasets in these locations for automatic detection: ./data/libritts/ # LibriTTS dataset ./data/ljspeech/ # LJSpeech dataset ./data/vctk/ # VCTK dataset ./data/svcc/ # SVCC dataset ./data/opensinger/ # OpenSinger dataset ./data/emilia/ # Emilia dataset ``` ### Custom Dataset Format ``` custom_dataset/ ├── train/ │ ├── speaker_001/ │ │ ├── audio_001.wav │ │ ├── audio_002.wav │ │ └── transcription.txt │ └── speaker_002/ └── val/ └── speaker_001/ ``` ## Pre-trained Model Hub ### HuggingFace Models ```bash # Access pre-trained models from transformers import AutoModel # Text-to-Speech models amphion/maskgct # State-of-the-art TTS amphion/vall-e # Zero-shot TTS amphion/vits-libritts # VITS trained on LibriTTS # Voice Conversion models amphion/vevo # Zero-shot VC amphion/naturalspeech3_facodec # FACodec amphion/metis # Foundation model for speech # Visit: https://huggingface.co/amphion ``` ### Local Pre-trained Models ```bash # Download pre-trained weights cd pretrained/ # Models are automatically fetched from HuggingFace # Or manually download: wget https://huggingface.co/amphion/vevo/resolve/main/vevo.pt mv vevo.pt vevo/ ``` ## Troubleshooting Commands ```bash # Check installation python -c "import amphion; print(amphion.__version__)" # Verify CUDA python -c "import torch; print(torch.cuda.is_available())" # Check GPU memory python -c "import torch; print(torch.cuda.mem_get_info())" # List available models ls pretrained/ # View training logs tail -f outputs//logs/train.log # Tensorboard visualization tensorboard --logdir outputs//tensorboard ``` ## Performance Tips ### Memory Optimization ```yaml # In configuration: train: gradient_accumulation_steps: 2 # Simulate larger batch enable_gradient_checkpointing: true use_cuda_amp: true # Mixed precision model: use_checkpoint: true # Gradient checkpointing ``` ### Speed Optimization ```bash # Use faster sampler for diffusion models inference: sampler: ddim # Faster than DDPM num_steps: 30 # Fewer steps use_fp16: true # Mixed precision # Distributed data loading data: num_workers: 8 # CPU workers for data loading prefetch_factor: 2 ``` ### Quality Optimization ```yaml # Higher quality settings train: max_epochs: 200 # Longer training learning_rate: 5e-4 # Lower learning rate weight_decay: 1e-4 # L2 regularization grad_clip: 1.0 # Tighter gradient clipping model: hidden_size: 512 # Larger model num_layers: 12 # More layers ``` ## Resources - **Official Documentation**: https://amphion.dev - **GitHub Repository**: https://github.com/open-mmlab/Amphion - **Paper (v0.2)**: https://arxiv.org/abs/2501.15442 - **Paper (v0.1)**: https://arxiv.org/abs/2312.09911 - **HuggingFace Models**: https://huggingface.co/amphion - **ModelScope**: https://modelscope.cn/organization/amphion - **Discord Community**: https://discord.com/invite/drhW7ajqAG --- # Singing Voice Conversion (SVC) in Amphion ## Overview Amphion's Singing Voice Conversion (SVC) module enables the conversion of singing voice from one speaker or musical style to another. It supports multiple state-of-the-art architectures and has been the subject of peer-reviewed research published at IEEE SLT 2024. ## Architecture Overview The SVC pipeline typically consists of three main components: 1. **Speaker-Agnostic Feature Extraction**: Extract content representations from source audio 2. **Speaker Embedding Injection**: Inject target speaker information 3. **Waveform Reconstruction**: Generate the output waveform using a vocoder ``` Source Audio → Content Features + Prosody → Acoustic Decoder → Vocoder → Target Audio ↓ Speaker Embedding ``` ## Content Feature Extraction SVC uses speaker-agnostic representations from multiple pretrained models: ### Content Features Extract linguistic content from audio using: - **WeNet**: Automatic Speech Recognition (ASR) based features - Chinese and English support - Robust content representation - https://github.com/wenet-e2e/wenet - **Whisper**: OpenAI's multilingual ASR model - Multi-language support - Robust to noise - Easy integration - https://github.com/openai/whisper - **ContentVec**: Self-supervised content representation - Language-universal features - Pre-trained on multilingual data - https://github.com/auspicious3000/contentvec ### Prosody Features Extract prosodic characteristics: - **F0 (Fundamental Frequency)**: Pitch estimation - **Energy**: Speech intensity and power Configuration example: ```yaml content_feature: type: whisper # or weinet, contentvec use_frame_alignment: true prosody: extract_f0: true extract_energy: true ``` ## Speaker Embeddings Represent target speaker characteristics: ### Speaker Look-Up Table - Pre-computed embeddings for each speaker - Fast inference - Requires speaker ID at test time ### Reference Encoder (Developing) - Extract speaker information from reference audio - Enable zero-shot SVC - No need for pre-computed embeddings ```python # Using speaker embeddings speaker_embedding = model.extract_speaker_embedding(reference_audio) output = model.inference(source_audio, speaker_embedding) ``` ## Acoustic Decoders ### Diffusion-Based Models #### DiffWaveNetSVC **Architecture**: Bidirectional Non-Causal Dilated CNN **Key Features**: - Diffusion probabilistic model framework - Similar to WaveNet and DiffWave - Multiple sampling algorithms support - Deterministic inference possible **Sampling Algorithms**: - **DDPM** (Denoising Diffusion Probabilistic Models): Standard diffusion sampling - **DDIM** (Denoising Diffusion Implicit Models): Faster inference - **PNDM** (Pseudo Numerical Methods): Improved quality - **Consistency Model**: Single-step fast inference **Configuration**: ```yaml acoustic_decoder: type: DiffWaveNetSVC num_layers: 20 num_channels: 128 inference: sampler: ddim # or ddpm, pndm, consistency steps: 50 ``` #### DiffComoSVC **Architecture**: Consistency Model based Diffusion **Key Features**: - Significantly faster inference than standard diffusion - Single-step or multi-step sampling - Maintains quality while reducing latency - Based on consistency models **Best For**: Real-time SVC applications ### Transformer-Based Models #### TransformerSVC **Architecture**: Encoder-only Non-autoregressive Transformer **Key Features**: - Pure attention mechanism - Parallel decoding for fast inference - Maintains long-range dependencies - Simple and efficient **Configuration**: ```yaml acoustic_decoder: type: TransformerSVC hidden_size: 384 num_layers: 6 num_heads: 4 feedforward_size: 1536 ``` ### VAE and Flow-Based Models #### VitsSVC **Architecture**: VITS-like Model with Content Features **Key Features**: - Variational autoencoder based - Conditional generation - Normalizing flow for flexible posterior - Similar to so-vits-svc **Paper**: https://arxiv.org/abs/2106.06103 ## Waveform Synthesis (Vocoders) After acoustic decoding, use a vocoder to generate the final waveform: Available vocoders: - **HiFi-GAN**: High-quality GAN-based - **NSF-HiFiGAN**: Noise suppression enhanced - **BigVGAN**: Large capacity GAN - **MelGAN**: Lightweight GAN - **WaveGlow**: Flow-based vocoder - **Diffwave**: Diffusion-based vocoder ```yaml vocoder: type: hifigan checkpoint: pretrained/vocoders/hifigan.pt ``` ## SVC Workflow ### 1. Data Preparation ```bash # Prepare SVC dataset python bins/data/preprocess_dataset.py \ --config config/svc/prepare_svcc.yaml \ --datasets svcc # For custom datasets: # - Structure: speaker_id/song_name/audio.wav # - Prepare annotations with content and prosody info ``` ### 2. Feature Extraction ```bash # Extract content features python bins/data/extract_acoustic_features.py \ --config config/svc/extract_whisper_feature.yaml \ --data-dir /path/to/svc/data # Extract prosody features python bins/data/extract_prosody.py \ --config config/svc/extract_prosody.yaml ``` ### 3. Training ```bash # Train SVC model python bins/train.py \ --config config/svc/DiffComoSVC/diffcomosvc.yaml \ --exp-name my_svc_model # Distributed training python -m torch.distributed.launch \ --nproc_per_node=8 \ bins/train.py \ --config config/svc/DiffComoSVC/diffcomosvc.yaml ``` ### 4. Inference ```python from amphion.models import build_model from amphion.utils import load_config import torch import soundfile as sf # Load model config = load_config('config/svc/DiffComoSVC/diffcomosvc.yaml') model = build_model(config) checkpoint = torch.load('checkpoints/my_svc_model.pt') model.load_state_dict(checkpoint['model']) model.eval() # Load audio import librosa source_audio, sr = librosa.load('source_song.wav') # Convert voice with torch.no_grad(): output = model.inference( source_audio, target_speaker_id=1, use_fastest=False # For DiffComoSVC ) # Save output sf.write('output.wav', output.cpu().numpy(), sr) ``` ## Supported Datasets Amphion provides recipes for: - **SVCC** (Singing Voice Conversion Challenge) - **VCTK** (Voice Conversion Challenge) - **M4Singer** (Chinese multi-singer dataset) - **Opencpop** (Chinese singing dataset) - **OpenSinger** (Multi-speaker singing) - **Emilia**: Large-scale multilingual singing data ## Model Architecture Comparison | Model | Type | Speed | Quality | Zero-Shot | Custom Reference | |-------|------|-------|---------|-----------|------------------| | DiffWaveNetSVC | Diffusion | Medium | High | No | No | | DiffComoSVC | Consistency | Fast | High | No | No | | TransformerSVC | Transformer | Fast | Medium | No | No | | VitsSVC | VAE/Flow | Fast | High | No | No | ## Configuration Structure ```yaml # Content and prosody features acoustic_features: content_feature: type: whisper # whisper, weinet, contentvec prosody: extract_f0: true extract_energy: true # Acoustic decoder acoustic_decoder: type: DiffComoSVC # Model selection hidden_size: 256 num_layers: 20 # Speaker embedding speaker_embedding: type: lookup # or reference_encoder num_speakers: 100 embedding_dim: 256 # Vocoder vocoder: type: hifigan checkpoint: pretrained/vocoders/hifigan.pt # Training train: batch_size: 16 num_epochs: 100 learning_rate: 1e-3 optimizer: adamw # Inference inference: sampler: ddim # For diffusion models sampler_steps: 50 ``` ## Advanced Features ### Multiple Content Features Amphion investigates multiple content representations in the official paper: ```yaml # Use multiple content features acoustic_features: content_features: - whisper - contentvec - weinet fusion_method: concatenate # or weighted ``` ### Zero-Shot SVC (Reference Encoder) Extract speaker info from reference audio at inference time: ```python reference_audio, sr = librosa.load('reference_song.wav') speaker_embedding = model.extract_speaker_embedding(reference_audio) output = model.inference( source_audio, speaker_embedding=speaker_embedding ) ``` ## Evaluation Metrics Evaluate SVC models using: - **MCD (Mel-Cepstral Distortion)**: Spectral similarity - **FAD (Frechet Audio Distance)**: Audio distribution distance - **PESQ**: Speech quality assessment - **Similarity Score**: Via speaker verification models (RawNet3, WeSpeaker) - **Intelligibility**: Via ASR (Whisper) ## Research Insights From the Amphion SLC 2024 paper on multiple content features: - **WeNet**: Best for content representation in singing - **Whisper**: Good multilingual support - **ContentVec**: Competitive performance Combining multiple features can improve overall quality. ## Troubleshooting ### Poor Output Quality 1. Check content feature extraction: - Verify alignment between source and extracted features - Try different content models 2. Verify speaker embeddings: - Ensure adequate speaker data - Check speaker embedding dimensions 3. Adjust vocoder: - Use higher-quality vocoder - Fine-tune vocoder on target domain ### Artifacts and Noise 1. Increase training duration 2. Use gradient accumulation for larger effective batch size 3. Try different sampler (DDIM → PNDM) 4. Increase sampler steps ### Slow Inference - Use DiffComoSVC for fast diffusion - Use fewer sampler steps - Reduce audio length - Use GPU acceleration ## Resources - **GitHub Recipe**: https://github.com/open-mmlab/Amphion/egs/svc/ - **Paper**: https://arxiv.org/abs/2310.11160 - **Demo**: https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html - **Community**: https://discord.com/invite/drhW7ajqAG --- # Text-to-Audio (TTA) in Amphion ## Overview Amphion's Text-to-Audio (TTA) module enables generation of diverse audio content from natural language descriptions. It uses a latent diffusion model architecture similar to AudioLDM, Make-an-Audio, and AUDIT. ## Architecture Overview The TTA system uses a two-stage approach: ### Stage 1: Latent Space Learning Train an autoencoder to compress audio into a latent space: ``` Raw Audio → Encoder → Latent Codes → Decoder → Reconstructed Audio (Compressed) ``` **Component**: `AutoencoderKL` in Amphion - Variational autoencoder with KL divergence - Compresses audio by ~4x - Learns meaningful latent representations ### Stage 2: Conditional Diffusion in Latent Space Train a diffusion model to generate latent codes conditioned on text: ``` Text Description → Text Encoder → CLIP/CLAP embeddings ↓ Diffusion Model ↓ Latent Codes → Decoder → Generated Audio ``` **Component**: `AudioLDM` in Amphion - Conditional latent diffusion model - Text-conditioned generation - Multiple sampling strategies ## TTA Capabilities ### Diverse Audio Generation Generate different types of audio from descriptions: - **Sound Effects**: Thunder, water splash, door knock - **Music**: Ambient, electronic, acoustic styles - **Environmental Audio**: Forest, traffic, rain sounds - **Speech**: Various prosody and emotion - **Hybrid Content**: Mixed audio scenarios ### Control and Conditioning Fine-grained control over generation: - **Text Prompts**: Descriptive text for generation - **Negative Prompts**: Specify unwanted characteristics - **Duration Control**: Control output length - **Style Control**: Specify audio style/genre - **Intensity Control**: Adjust generation strength ## TTA Workflow ### 1. Model Architecture Setup Configure the two-stage model: ```yaml # Stage 1: VAE (AutoencoderKL) autoencoder_kl: type: AutoencoderKL in_channels: 1 out_channels: 1 latent_channels: 8 hidden_channels: 128 # Stage 2: Diffusion Model diffusion_model: type: AudioLDM latent_channels: 8 text_encoder: t5 # or clap, clip num_steps: 1000 # Diffusion steps ``` ### 2. Data Preparation Prepare audio-text pairs: ```bash # Directory structure dataset/ ├── audio/ │ ├── sound_001.wav │ ├── sound_002.wav │ └── ... └── text_descriptions/ ├── sound_001.txt ├── sound_002.txt └── ... # Each text file contains description of corresponding audio ``` Preprocess audio: ```bash python bins/data/preprocess_tta.py \ --audio-dir path/to/audio \ --text-dir path/to/descriptions \ --output-dir processed_data ``` ### 3. Stage 1: Train AutoencoderKL First, train the VAE to learn latent representation: ```bash python bins/train.py \ --config config/tta/autoencoderkl.yaml \ --exp-name tta_vae ``` Configuration: ```yaml model: type: AutoencoderKL # ... architecture parameters data: dataset: audio_descriptions batch_size: 32 num_workers: 4 train: max_epochs: 50 learning_rate: 1e-3 loss_type: mse # Reconstruction loss ``` ### 4. Stage 2: Train AudioLDM After VAE training, train the diffusion model: ```bash python bins/train.py \ --config config/tta/audioldm.yaml \ --exp-name tta_diffusion ``` Configuration: ```yaml model: type: AudioLDM # ... architecture parameters pretrained_vae: path/to/vae_checkpoint.pt # From stage 1 # Text encoder for conditioning text_encoder: type: t5 # or clap, clip model_name: t5-base freeze_encoder: false data: batch_size: 16 train: max_epochs: 100 learning_rate: 5e-5 # Diffusion training specifics ``` ### 5. Inference Generate audio from text descriptions: ```python from amphion.models import build_model from amphion.utils import load_config import torch import torchaudio # Load models config = load_config('config/tta/audioldm.yaml') vae = build_model(config.vae_config) diffusion_model = build_model(config.diffusion_config) # Load checkpoints vae.load_state_dict(torch.load('checkpoints/vae.pt')) diffusion_model.load_state_dict(torch.load('checkpoints/diffusion.pt')) vae.eval() diffusion_model.eval() # Generate audio text_prompt = "A dog barking in the distance with ambient traffic noise" with torch.no_grad(): # Text encoding text_embeddings = diffusion_model.encode_text(text_prompt) # Diffusion sampling in latent space latent_codes = diffusion_model.sample( embeddings=text_embeddings, num_steps=50, guidance_scale=7.5 ) # Decode to audio audio = vae.decode(latent_codes) # Save generated audio torchaudio.save('output.wav', audio.squeeze(0), 16000) ``` ### 6. Advanced Inference Options #### Negative Prompts Specify what NOT to generate: ```python output = diffusion_model.sample( prompt="Dog barking", negative_prompt="cat, bird, quiet", guidance_scale=7.5 ) ``` #### Classifier-Free Guidance Control generation strength: ```python output = diffusion_model.sample( prompt="Thunder storm with heavy rain", guidance_scale=10.0 # Higher = stronger adherence to prompt ) ``` #### Sampling Methods Different diffusion samplers: ```python # DDPM (standard) output = diffusion_model.sample(sampler='ddpm', num_steps=1000) # DDIM (faster) output = diffusion_model.sample(sampler='ddim', num_steps=50) # PNDM (quality + speed balance) output = diffusion_model.sample(sampler='pndm', num_steps=50) # Euler output = diffusion_model.sample(sampler='euler', num_steps=30) ``` #### Seed Control Reproducible generation: ```python torch.manual_seed(42) output1 = diffusion_model.sample(prompt="dog barking") torch.manual_seed(42) output2 = diffusion_model.sample(prompt="dog barking") # output1 and output2 are identical ``` ## Text Encoders TTA can use different text encoders: ### T5 (Text-to-Text Transfer Transformer) ```python text_encoder = T5Tokenizer.from_pretrained('t5-base') embeddings = text_encoder('A dog barking') # Shape: [1, seq_length, 768] ``` ### CLAP (Contrastive Language-Audio Pre-training) ```python # CLAP embeddings are audio-aligned text_encoder = CLAPTextEncoder() embeddings = text_encoder('A dog barking') # Shape: [1, 512] - audio-aligned representations ``` ### CLIP (Vision-Language Model) Alternative multi-modal conditioning ## Supported Datasets Amphion supports TTA training on: - **AudioCaps**: 49k audio clips with captions - **Clotho**: 5k audio samples with multiple descriptions - **Emilia**: Large-scale speech descriptions - **Custom Datasets**: With proper annotation format Dataset structure: ```yaml dataset: name: audiocaps root_dir: /path/to/audiocaps split: train # or val, test # Preprocessing preprocessing: sample_rate: 16000 num_mels: 64 n_fft: 400 hop_length: 160 ``` ## Configuration Structure Complete TTA configuration: ```yaml # Stage 1: VAE Configuration vae_config: model: type: AutoencoderKL in_channels: 1 latent_channels: 8 hidden_channels: 128 num_res_blocks: 2 train: learning_rate: 1e-3 batch_size: 32 max_epochs: 50 # Stage 2: Diffusion Configuration diffusion_config: model: type: AudioLDM latent_channels: 8 hidden_channels: 512 num_layers: 24 attention_heads: 8 text_encoder: type: t5 # or clap freeze: false diffusion: beta_schedule: linear num_steps: 1000 train: learning_rate: 5e-5 batch_size: 16 max_epochs: 100 warmup_steps: 5000 # Data configuration data: dataset: audiocaps sample_rate: 16000 num_mels: 64 # Inference configuration inference: sampler: ddim num_steps: 50 guidance_scale: 7.5 ``` ## Performance Metrics Evaluate TTA quality using: - **FAD (Frechet Audio Distance)**: Audio distribution similarity - **KL Divergence**: Distribution divergence metric - **PESQ**: Perceived speech quality (for speech-like audio) - **Inception Score**: Diversity and quality metric - **Text Alignment Score**: How well generated audio matches text ## Troubleshooting ### Poor Audio Quality 1. **Increase training**: More epochs, larger dataset 2. **Improve text descriptions**: More detailed, specific prompts 3. **Adjust guidance scale**: Higher values (7.5-15.0) 4. **Try different sampler**: PNDM often better than DDIM ### Mode Collapse (Repetitive Outputs) 1. Increase diversity regularization 2. Use higher temperature in sampling 3. Augment training data with more diverse examples ### Slow Inference 1. Use fewer diffusion steps (DDIM with 30-50 steps) 2. Use GPU acceleration 3. Reduce audio quality (lower sample rate) ### Training Instability 1. Lower learning rate 2. Smaller batch size 3. Gradient clipping 4. Warm-up scheduler ## Advanced Topics ### Fine-tuning Pre-trained Models ```bash python bins/train.py \ --config config/tta/audioldm_finetune.yaml \ --pretrained-diffusion pretrained/audioldm.pt \ --custom-data path/to/custom/data ``` ### Conditioning on Audio Features Additional conditioning beyond text: ```python # Condition on audio duration output = diffusion_model.sample( prompt="dog barking", duration=3.0 # 3 seconds ) # Condition on loudness output = diffusion_model.sample( prompt="dog barking", loudness_db=-10 # Target loudness ) ``` ### Audio Style Transfer Transfer audio style while maintaining content: ```python # Reference audio for style style_audio, sr = librosa.load('reference.wav') style_embeddings = diffusion_model.extract_style(style_audio) # Generate with style output = diffusion_model.sample( prompt="dog barking", style_embeddings=style_embeddings ) ``` ## Research Background TTA in Amphion is based on: - **AudioLDM**: Latent diffusion for audio generation (2301.12503) - **Make-an-Audio**: Large-scale audio generation (2301.12661) - **AUDIT**: Audio understanding through diffusion (2304.00830) These models showed that latent diffusion is highly effective for audio synthesis. ## Resources - **GitHub Recipe**: https://github.com/open-mmlab/Amphion/egs/tta/ - **Beginner Recipe**: https://github.com/open-mmlab/Amphion/egs/tta/RECIPE.md - **Amphion Paper**: https://arxiv.org/abs/2304.00830 - **AudioLDM Paper**: https://arxiv.org/abs/2301.12503 - **Community**: https://discord.com/invite/drhW7ajqAG --- # Text-to-Speech (TTS) in Amphion ## Overview Amphion's Text-to-Speech (TTS) module provides state-of-the-art text-to-speech capabilities with multiple supported architectures. The TTS system converts natural language text into high-quality synthesized speech with controllable prosody and speaker characteristics. ## Supported TTS Models ### 1. FastSpeech2 **Architecture**: Non-autoregressive Transformer-based **Key Features**: - Feed-forward Transformer blocks - Faster inference than autoregressive models - Supports multiple speakers - Duration prediction for prosody control - Pitch and energy prediction **Best For**: Real-time TTS applications, multi-speaker synthesis **Configuration Location**: `config/tts/FastSpeech2/` ### 2. VITS (Variational Inference with adversarial Learning for end-to-end Text-to-Speech) **Architecture**: End-to-end with Conditional VAE and Adversarial Learning **Key Features**: - Conditional variational autoencoder - Adversarial training with discriminator - Integrated vocoder for waveform generation - Excellent voice quality - Supports multiple speakers **Best For**: High-quality speech synthesis, end-to-end training **Paper**: https://arxiv.org/abs/2106.06103 **Configuration Location**: `config/tts/VITS/` ### 3. VALL-E (Voice Across Languages Language Encoding) **Architecture**: Neural Codec Language Model with Discrete Codes **Key Features**: - Zero-shot TTS capabilities - Uses discrete audio tokens - Few-shot voice adaptation - Multilingual support - Large-scale pre-training **Best For**: Zero-shot voice cloning, multilingual synthesis **Paper**: https://arxiv.org/abs/2301.02111 **Configuration Location**: `config/tts/VALLE/` ### 4. NaturalSpeech2 **Architecture**: Latent Diffusion Model **Key Features**: - Diffusion-based generation - Natural prosody modeling - Improved speech quality - Controllable generation - Superior naturalness **Best For**: Natural-sounding speech, research and development **Paper**: https://arxiv.org/abs/2304.09116 **Configuration Location**: `config/tts/NaturalSpeech2/` ### 5. Jets (Joint End-to-end Text-to-Speech) **Architecture**: Joint Training of FastSpeech2 and HiFi-GAN **Key Features**: - Joint optimization of acoustic model and vocoder - Alignment module for duration prediction - End-to-end training - Improved consistency between stages **Best For**: Unified acoustic and vocoder training **Configuration Location**: `config/tts/Jets/` ### 6. MaskGCT (Masked Generator-Conditioner-Target) **Architecture**: Fully Non-autoregressive Architecture **Key Features**: - Eliminates explicit text-speech alignment requirements - Fully non-autoregressive generation - State-of-the-art performance - Zero-shot capabilities - Fast inference **Best For**: Fast, alignment-free TTS, zero-shot synthesis **Paper**: https://arxiv.org/abs/2409.00750 **Availability**: Pre-trained models on HuggingFace and ModelScope ### 7. Vevo-TTS **Architecture**: Autoregressive + Flow-Matching Transformer **Key Features**: - Zero-shot TTS with controllable timbre and style - Flexible voice control - Speech and singing voice synthesis - Multiple voice aspects controllable - Style transfer capabilities **Best For**: Controllable zero-shot TTS, voice cloning with style control **Paper**: https://openreview.net/pdf?id=anQDiQZhDP **Configuration Location**: `models/vc/vevo/` ## Common TTS Workflow ### 1. Data Preparation ```bash # Prepare your dataset cd Amphion python bins/data/preprocess_dataset.py \ --config config/tts/VITS/prepare_libritts.yaml \ --datasets libritts # For custom datasets, modify the configuration file to point to your data ``` ### 2. Training ```bash # Train a TTS model python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_tts_model # Resume from checkpoint python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name my_tts_model \ --resume ``` ### 3. Inference ```python from amphion.models import build_model from amphion.utils import load_config import torch # Load model config = load_config('config/tts/VITS/vits.yaml') model = build_model(config) checkpoint = torch.load('path/to/checkpoint.pt') model.load_state_dict(checkpoint['model']) model.eval() # Generate speech with torch.no_grad(): text = "Hello, this is a test." output = model.inference(text) ``` ### 4. Evaluation ```bash # Evaluate TTS model python bins/metrics/eval.py \ --config config/tts/VITS/vits.yaml \ --checkpoint path/to/checkpoint.pt ``` ## Configuration Structure TTS configurations follow this general structure: ```yaml # Model architecture model: type: VITS # or FastSpeech2, VALL-E, etc. hidden_size: 384 encoder_hidden_size: 384 # ... model-specific parameters # Data configuration data: dataset: libritts data_dir: /path/to/data batch_size: 16 num_workers: 4 # Training configuration train: max_epochs: 100 learning_rate: 1e-3 optimizer: adam grad_clip: 5.0 # Inference configuration inference: speaker_id: 0 # For multi-speaker models duration_scale: 1.0 pitch_scale: 1.0 ``` ## Multi-Speaker TTS For models supporting multiple speakers: ```python # Specify speaker ID during inference output = model.inference( text="Hello world", speaker_id=1 ) # Or use speaker embedding speaker_embedding = model.get_speaker_embedding(speaker_id=1) output = model.inference(text="Hello world", speaker_embedding=speaker_embedding) ``` ## Supported Datasets Amphion supports preprocessing for these TTS datasets: - **LibriTTS**: Large-scale multi-speaker English speech - **LJSpeech**: Single-speaker English speech - **VCTK**: Multi-speaker English speech - **OpenSinger**: Chinese singing voice - **M4Singer**: Chinese multi-speaker singing - **Emilia**: Multilingual in-the-wild speech (101k+ hours) ## Voice Characteristics Control Different TTS models offer various levels of control: ### Duration Control (FastSpeech2, VITS) ```python # Speed up or slow down speech output = model.inference( text="Hello world", duration_scale=0.8 # 20% faster ) ``` ### Pitch Control ```python # Modify fundamental frequency output = model.inference( text="Hello world", pitch_scale=1.2 # Higher pitch ) ``` ### Energy Control ```python # Adjust speaking energy/intensity output = model.inference( text="Hello world", energy_scale=0.9 ) ``` ## Vocoder Integration Most TTS models require a vocoder to convert acoustic features to waveform: ```bash # Train with HiFi-GAN vocoder python bins/train.py \ --config config/tts/VITS/vits_hifigan.yaml ``` Available vocoders: - HiFi-GAN (default) - BigVGAN - NSF-HiFiGAN - MelGAN - WaveGlow ## Pre-trained Models Access pre-trained models from: - **HuggingFace**: https://huggingface.co/amphion - MaskGCT, Vevo, and others - **ModelScope**: https://modelscope.cn/organization/amphion - MaskGCT, Metis, and others - **Local**: Provided in `pretrained/` directory ### Using Pre-trained Models ```python from amphion.models import build_model # Load pre-trained VALL-E model = build_model(config) model.load_pretrained('amphion/vall-e') # Inference output = model.inference("Your text here") ``` ## TTS Demo Samples Listen to TTS samples from Amphion models: https://openhlt.github.io/Amphion_TTS_Demo/ ## Performance Metrics TTS quality is evaluated using: - **MOS (Mean Opinion Score)**: Subjective speech quality (scale 1-5) - **PESQ (Perceptual Evaluation of Speech Quality)**: Objective speech quality - **FAD (Frechet Audio Distance)**: Distribution distance metric - **WER (Word Error Rate)**: Via ASR (Whisper) - **Speaker Similarity**: Via speaker verification models ## Troubleshooting ### Out-of-Memory Errors ```yaml # Reduce batch size train: batch_size: 8 # Decrease from default # Enable gradient accumulation gradient_accumulation_steps: 2 # Enable gradient checkpointing model: use_checkpoint: true ``` ### Poor Voice Quality - Ensure high-quality training data - Increase training duration - Adjust learning rate schedule - Try different vocoder ### Alignment Issues (for models needing alignment) - Use Montreal Forced Aligner (MFA) for better alignment - Adjust forced alignment configuration - Check data quality ## Advanced Topics ### Fine-tuning Pre-trained Models ```bash python bins/train.py \ --config config/tts/VITS/vits.yaml \ --exp-name fine_tune \ --pretrained-model-name amphion/vits-libritts \ --resume ``` ### Knowledge Distillation Train a student model from a teacher: ```yaml distillation: enabled: true teacher_model: vits temperature: 5.0 alpha: 0.5 ``` ### Data Augmentation ```yaml data_augmentation: speed_perturb: [0.95, 1.05] pitch_shift: [-2, 2] energy_scale: [0.9, 1.1] ``` ## Resources - **Official Docs**: https://amphion.dev - **GitHub Repo**: https://github.com/open-mmlab/Amphion - **Paper**: https://arxiv.org/abs/2312.09911 - **Community**: https://discord.com/invite/drhW7ajqAG --- # Voice Conversion (VC) in Amphion ## Overview Amphion's Voice Conversion module enables zero-shot and few-shot voice conversion with fine-grained control over speaker characteristics. It supports multiple advanced models designed for quality, naturalness, and flexibility. ## Voice Conversion Capabilities Voice Conversion in Amphion can handle: - **Voice Conversion (VC)**: Convert speaker identity while preserving content - **Accent Conversion (AC)**: Change accent while maintaining speaker characteristics - **Timbre Conversion**: Adjust voice timbre and color - **Style Conversion**: Modify speaking/singing style ## Supported Voice Conversion Models ### 1. Vevo (VersatileVoice) **Architecture**: Zero-shot voice imitation framework with controllable timbre and style **Released**: December 2024 **Key Features**: - **Zero-shot capabilities**: Convert any voice without fine-tuning - **Controllable generation**: Independent control of timbre and style - **Dual-branch design**: - **Vevo-Timbre**: Style-preserved voice conversion - **Vevo-Voice**: Style-converted voice conversion - **Multi-task capability**: - Voice Conversion (VC) - Text-to-Speech (TTS) - Accent Conversion (AC) - Speech Enhancement **Model Details**: - Autoregressive Transformer + Flow-Matching Transformer - Trained on Emilia dataset (101k+ hours) - State-of-the-art zero-shot VC performance - Pre-trained models available on HuggingFace **Paper**: https://openreview.net/pdf?id=anQDiQZhDP **Configuration Location**: `models/vc/vevo/` #### Vevo Usage Example ```python from amphion.models import build_model # Load pre-trained Vevo model model = build_model(config) model.load_pretrained('amphion/vevo') # Voice conversion with style preservation output = model.inference( source_audio='input.wav', target_speaker_audio='reference.wav', mode='timbre' # Preserve style ) # Voice conversion with style transfer output = model.inference( source_audio='input.wav', target_speaker_audio='reference.wav', mode='voice' # Convert both timbre and style ) ``` #### Vevo1.5 (April 2025) Enhanced version extending Vevo with: - Unified speech and singing voice generation - More robust generation - Extended zero-shot capabilities - Better accent conversion **Blog**: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a ### 2. FACodec (Frequency Augmentative Codec) **Architecture**: Neural audio codec with decomposition **Key Features**: - Decomposes speech into subspaces: - **Content**: Linguistic information - **Prosody**: Pitch and duration patterns - **Timbre**: Speaker-specific characteristics - Zero-shot voice conversion - Flexible audio manipulation - Continuous representation **Paper**: https://arxiv.org/abs/2403.03100 **Available Models**: - NaturalSpeech3 FACodec - Pre-trained checkpoint on HuggingFace **Usage**: ```python from amphion.models import build_model # Load FACodec model = build_model(config) model.load_pretrained('amphion/naturalspeech3_facodec') # Decompose speech content, prosody, timbre = model.decompose(audio) # Reconstruct with different timbre output = model.reconstruct(content, prosody, target_timbre) ``` ### 3. Noro (Noise-Robust Voice Conversion) **Architecture**: Zero-shot voice conversion for noisy conditions **Released**: 2024 **Key Features**: - **Noise robustness**: Handles noisy reference speeches - **Dual-branch reference encoding**: - Speech branch: Capture voice characteristics - Noise branch: Suppress noise information - **Contrastive learning**: Noise-agnostic speaker loss - Zero-shot capability - Robust to various noise types **Paper**: https://arxiv.org/abs/2411.19770 **Best For**: Real-world voice conversion with background noise **Configuration Location**: `egs/vc/Noro/` ## Metis Foundation Model (February 2025) **Purpose**: Unified speech generation foundation model **Capabilities**: - Zero-shot text-to-speech - Voice conversion - Target speaker extraction - Speech enhancement - Lip-to-speech **Pre-trained Models**: Available on HuggingFace **Paper**: https://arxiv.org/pdf/2502.03128 ## VC Workflow ### 1. Voice Conversion Inference Using Vevo for zero-shot VC: ```bash # Command line inference python bins/inference.py \ --config config/vc/vevo/vevo.yaml \ --checkpoint pretrained/vevo/vevo.pt \ --input-audio source.wav \ --reference-audio target_speaker.wav \ --output-path output.wav ``` ### 2. Python API ```python from amphion.models import build_model from amphion.utils import load_config import librosa import soundfile as sf # Load model configuration config = load_config('config/vc/vevo/vevo.yaml') model = build_model(config) # Load pre-trained weights model.load_pretrained('amphion/vevo') model.eval() # Load audio files source_audio, sr = librosa.load('source.wav', sr=16000) reference_audio, _ = librosa.load('reference.wav', sr=16000) # Perform voice conversion with torch.no_grad(): output = model.inference( source_audio=source_audio, target_speaker_audio=reference_audio, mode='timbre', # or 'voice' pitch_scale=1.0, energy_scale=1.0 ) # Save output sf.write('output.wav', output.cpu().numpy(), sr) ``` ## VC Applications ### 1. Voice Cloning Clone a speaker's voice for new content: ```python # Reference audio from target speaker reference_audio, sr = librosa.load('speaker_voice.wav') # Text to convert (via TTS first, or use existing speech) source_speech, _ = librosa.load('source_speech.wav') # Convert output = model.voice_conversion( source_speech, reference_audio, mode='voice' ) ``` ### 2. Accent Conversion Modify accent while preserving speaker identity: ```python # Reference audio with target accent reference_audio, sr = librosa.load('target_accent.wav') # Apply accent conversion output = model.accent_conversion( source_speech, reference_audio ) ``` ### 3. Timbre Adjustment Modify voice characteristics: ```python # Reference audio with desired timbre reference_audio, sr = librosa.load('reference.wav') # Apply timbre modification output = model.timbre_conversion( source_speech, reference_audio, preservation_strength=0.7 # Balance between preservation and conversion ) ``` ### 4. Real-World Applications With Noro for robust VC: ```python # Handle noisy reference audio noisy_reference_audio, sr = librosa.load('noisy_reference.wav') output = model.robust_voice_conversion( source_speech, noisy_reference_audio, noise_robustness=True ) ``` ## Configuration Structure ```yaml # Model architecture model: type: Vevo # or FACodec, Noro, Metis hidden_size: 256 num_layers: 12 # Encoder configuration encoder: type: transformer num_heads: 8 # Decoder configuration decoder: type: transformer num_heads: 8 # Vocoder vocoder: type: hifigan checkpoint: pretrained/vocoders/hifigan.pt # Inference settings inference: mode: timbre # or voice pitch_scale: 1.0 energy_scale: 1.0 duration_scale: 1.0 ``` ## Audio Quality Control Control output characteristics: ```python output = model.inference( source_audio=source, target_speaker_audio=reference, # Voice quality parameters pitch_scale=1.0, # Adjust pitch (0.5-2.0) energy_scale=1.0, # Adjust loudness (0.5-2.0) duration_scale=1.0, # Adjust speaking rate (0.5-2.0) # Conversion intensity conversion_strength=1.0, # 0.0 = no change, 1.0 = full conversion ) ``` ## Pre-trained Models ### Vevo - HuggingFace: https://huggingface.co/amphion/Vevo - ModelScope: https://modelscope.cn/models/amphion/Vevo - All pre-trained on Emilia dataset ### FACodec - HuggingFace: https://huggingface.co/amphion/naturalspeech3_facodec - Pre-trained model checkpoint included ### Noro - Available in repository - Trained on multiple voice conversion datasets ### Metis - HuggingFace: https://huggingface.co/amphion/metis - Foundation model for unified speech generation ## Supported Datasets for Training - **VCTK**: Multi-speaker English speech - **TIMIT**: Phonetically balanced speech - **VoxCeleb**: Speaker recognition dataset - **Emilia**: Large-scale multilingual in-the-wild data - **Custom datasets**: With proper preprocessing ## Performance Metrics Evaluate voice conversion using: - **MCD (Mel-Cepstral Distortion)**: Spectral similarity - **FAD (Frechet Audio Distance)**: Perceptual quality - **Speaker Similarity**: Via speaker verification models - RawNet3 - WeSpeaker - WavLM - **Content Preservation**: Via ASR (Whisper) - **PESQ**: Voice quality metric ## Comparison with Baselines | Model | Zero-Shot | Robustness | Speed | Quality | |-------|-----------|-----------|-------|---------| | Vevo | Yes | Medium | Fast | High | | Vevo1.5 | Yes | High | Fast | Very High | | FACodec | Yes | Medium | Fast | High | | Noro | Yes | Very High | Medium | High | | Metis | Yes | High | Medium | Very High | ## Advanced Features ### Multi-Reference Voice Cloning Use multiple reference speakers: ```python # Multiple references references = [ ('speaker1.wav', 0.3), ('speaker2.wav', 0.5), ('speaker3.wav', 0.2), ] output = model.multi_reference_conversion( source_audio, references=references ) ``` ### Fine-tuning for Custom Voices ```bash python bins/train.py \ --config config/vc/vevo/vevo_finetune.yaml \ --pretrained-model-name amphion/vevo \ --custom-speaker-data path/to/speaker/data ``` ### Streaming/Online Voice Conversion For real-time applications: ```python model.set_inference_mode('streaming') output = model.streaming_inference( audio_stream, # Streaming audio input reference_audio, chunk_length=8000 # Process in chunks ) ``` ## Troubleshooting ### Voice Quality Issues 1. **Artifacts**: Use higher-quality reference audio 2. **Unnatural pitch**: Adjust pitch_scale parameter 3. **Poor timbre**: Try different reference speakers 4. **Noisy output**: Increase reference audio quality or use Noro ### Inference Speed - Use GPU acceleration - Reduce audio length - Use VQ-based models for faster inference ### Memory Issues ```python # Enable gradient checkpointing if training model.enable_gradient_checkpointing() # Reduce batch size for inference model.set_batch_size(1) ``` ## Resources - **GitHub VC Module**: https://github.com/open-mmlab/Amphion/tree/main/models/vc - **Vevo Paper**: https://openreview.net/pdf?id=anQDiQZhDP - **FACodec Paper**: https://arxiv.org/abs/2403.03100 - **Noro Paper**: https://arxiv.org/abs/2411.19770 - **Metis Paper**: https://arxiv.org/pdf/2502.03128 - **Demo**: https://versavoice.github.io/ - **Community**: https://discord.com/invite/drhW7ajqAG