# Amphion

> Amphion provides comprehensive objective evaluation capabilities and multiple state-of-the-art neural vocoders for audio synthesis tasks.

---

# Evaluation Metrics and Vocoders in Amphion

## Overview

Amphion provides comprehensive objective evaluation capabilities and multiple state-of-the-art neural vocoders for audio synthesis tasks.

## Evaluation Metrics

Amphion implements a complete set of evaluation metrics for assessing audio generation quality across multiple dimensions:

### F0 Modeling Metrics

Evaluate pitch/fundamental frequency accuracy:

- **F0 Pearson Coefficient**: Correlation between predicted and ground truth F0
  - Range: -1.0 to 1.0
  - Higher is better
  - Measures pitch contour tracking

- **F0 Periodicity RMSE**: Root Mean Square Error for voiced/unvoiced detection
  - Measures periodicity accuracy
  - Lower is better

- **F0 RMSE**: Root Mean Square Error for pitch value prediction
  - Measures absolute pitch accuracy
  - Lower is better

- **Voiced/Unvoiced F1 Score**: Binary classification accuracy
  - Measures ability to detect voiced vs unvoiced segments
  - Range: 0-1, higher is better

```python
from amphion.evaluation import F0Metrics

f0_metrics = F0Metrics()

# Compute metrics
pearson_coef = f0_metrics.pearson_coefficient(pred_f0, gt_f0)
voicing_f1 = f0_metrics.voicing_f1(pred_voicing, gt_voicing)
rmse = f0_metrics.f0_rmse(pred_f0, gt_f0)
```

### Energy Modeling Metrics

Evaluate energy/amplitude accuracy:

- **Energy RMSE**: Root Mean Square Error for energy prediction
  - Lower is better
  - Measures amplitude accuracy

- **Energy Pearson Coefficient**: Correlation with ground truth energy
  - Range: -1.0 to 1.0
  - Higher is better

```python
from amphion.evaluation import EnergyMetrics

energy_metrics = EnergyMetrics()

rmse = energy_metrics.energy_rmse(pred_energy, gt_energy)
pearson = energy_metrics.pearson_coefficient(pred_energy, gt_energy)
```

### Intelligibility Metrics

Measure content preservation and clarity:

- **Character Error Rate (CER)**: Character-level WER
  - Requires ASR model (Whisper)
  - Lower is better (0% = perfect)

- **Word Error Rate (WER)**: Word-level error rate
  - Requires ASR model (Whisper)
  - Lower is better (0% = perfect)

```python
from amphion.evaluation import IntelligibilityMetrics
from amphion.models import WhisperExtractor

intelligibility = IntelligibilityMetrics(
    asr_model=WhisperExtractor('base')
)

wer = intelligibility.word_error_rate(audio, reference_text)
cer = intelligibility.character_error_rate(audio, reference_text)
```

### Spectrogram Distortion Metrics

Measure audio quality and similarity:

#### Frechet Audio Distance (FAD)

- **Purpose**: Perceptual audio quality metric
- **Range**: 0 to infinity (lower is better)
- **Based on**: VGGish audio feature embeddings
- **Interpretation**: Distance between distributions

```python
from amphion.evaluation import FADMetrics

fad_metrics = FADMetrics()

# Compute FAD
fad = fad_metrics.compute(generated_audio, reference_audio)
# Typical good value: < 3.0
```

#### Mel-Cepstral Distortion (MCD)

- **Purpose**: Spectral similarity measure
- **Range**: 0 to infinity (lower is better)
- **Best For**: Voice conversion, TTS
- **Unit**: dB

```python
from amphion.evaluation import MCDMetrics

mcd_metrics = MCDMetrics()

# Compute MCD
mcd = mcd_metrics.compute(predicted, reference)
# Typical good value: < 5.0 dB
```

#### Multi-Resolution STFT Distance (MSTFT)

- **Purpose**: Multi-scale spectral comparison
- **Uses**: Multiple window sizes and FFT lengths
- **Range**: 0 to infinity (lower is better)

```python
from amphion.evaluation import MSTFTMetrics

mstft_metrics = MSTFTMetrics()

mag_loss, phase_loss = mstft_metrics.compute(predicted, reference)
```

#### PESQ (Perceptual Evaluation of Speech Quality)

- **Purpose**: Subjective speech quality prediction
- **Range**: -0.5 to 4.5 (higher is better)
- **Best For**: Speech synthesis quality
- **Correlation**: High correlation with MOS

```python
from amphion.evaluation import PESQMetrics

pesq_metrics = PESQMetrics()

score = pesq_metrics.compute(reference, generated)
# Typical good value: > 3.0
```

#### STOI (Short Time Objective Intelligibility)

- **Purpose**: Speech intelligibility metric
- **Range**: 0 to 1 (higher is better)
- **Based on**: SNR estimates in bark bands

```python
from amphion.evaluation import STOIMetrics

stoi_metrics = STOIMetrics()

score = stoi_metrics.compute(reference, generated)
# Typical good value: > 0.8
```

### Speaker Similarity Metrics

Measure speaker identity preservation:

Supported speaker verification models:
- **RawNet3**: End-to-end speaker recognition
- **Resemblyzer**: Simple speaker embedding
- **WeSpeaker**: WeNet speaker embedding
- **WavLM**: Large multilingual model

```python
from amphion.evaluation import SpeakerSimilarityMetrics
from amphion.models import RawNet3

speaker_metrics = SpeakerSimilarityMetrics(
    extractor=RawNet3()
)

# Cosine similarity
similarity = speaker_metrics.cosine_similarity(audio1, audio2)
# Range: -1 to 1 (higher = more similar speaker)
```

## Evaluation Workflow

### Complete Evaluation Script

```python
from amphion.evaluation import (
    FADMetrics, MCDMetrics, PESQMetrics,
    SpeakerSimilarityMetrics, IntelligibilityMetrics
)
import numpy as np

# Initialize metrics
fad = FADMetrics()
mcd = MCDMetrics()
pesq = PESQMetrics()
speaker_sim = SpeakerSimilarityMetrics()
intelligibility = IntelligibilityMetrics()

# Load generated and reference audio
generated_audio = load_audio('generated.wav')
reference_audio = load_audio('reference.wav')

# Compute all metrics
results = {
    'fad': fad.compute(generated_audio, reference_audio),
    'mcd': mcd.compute(generated_audio, reference_audio),
    'pesq': pesq.compute(reference_audio, generated_audio),
    'speaker_sim': speaker_sim.cosine_similarity(
        generated_audio, reference_audio
    ),
}

# Print results
for metric, value in results.items():
    print(f"{metric}: {value:.4f}")
```

### Batch Evaluation

```bash
python bins/metrics/eval.py \
  --config config/tts/VITS/vits.yaml \
  --checkpoint checkpoints/vits.pt \
  --test-dir path/to/test/data \
  --output-csv results.csv
```

Configuration for batch evaluation:

```yaml
evaluation:
  metrics:
    - fad
    - mcd
    - pesq
    - speaker_similarity
    - intelligibility

  fad:
    model: vggish

  pesq:
    sample_rate: 16000

  speaker_similarity:
    extractor: rawnet3

  intelligibility:
    asr_model: whisper-base
```

## Neural Vocoders

Vocoders convert acoustic features (mel-spectrograms) to waveforms. Amphion supports multiple vocoder architectures.

### GAN-Based Vocoders

#### HiFi-GAN

**Paper**: https://arxiv.org/abs/2010.05646

**Key Features**:
- High-quality audio generation
- Fast inference
- Lightweight architecture
- Multi-scale discriminators

**Configuration**:
```yaml
vocoder:
  type: hifigan
  pretrained: true
  checkpoint: pretrained/vocoders/hifigan.pt
```

**Performance**:
- MOS: ~3.8-4.0
- Inference speed: Real-time (>10x)
- Model size: ~3.6M parameters

#### NSF-HiFiGAN

**Enhancement**: NSF (Neural Source Filter) + HiFi-GAN

**Key Features**:
- Improved pitch accuracy
- Better F0 modeling
- Faster convergence

```yaml
vocoder:
  type: nsf_hifigan
  f0_quantizer: linear
```

#### BigVGAN

**Paper**: https://arxiv.org/abs/2206.04658

**Key Features**:
- Larger capacity model
- Superior audio quality
- Better generalization
- Improved high-frequency content

```yaml
vocoder:
  type: bigvgan
  pretrained: true
```

#### APNet

**Paper**: https://arxiv.org/abs/2305.07952

**Key Features**:
- Adaptive parallel architecture
- Efficient design
- High-quality output

```yaml
vocoder:
  type: apnet
```

#### MelGAN

**Lightweight option for fast inference**

**Key Features**:
- Small model size
- Fast inference
- Mobile-friendly

```yaml
vocoder:
  type: melgan
```

### Flow-Based Vocoders

#### WaveGlow

**Paper**: https://arxiv.org/abs/1811.00002

**Key Features**:
- Normalizing flow model
- Parallel generation
- Invertible transformation

**Configuration**:
```yaml
vocoder:
  type: waveglow
  n_flows: 12
  n_group: 8
```

### Diffusion-Based Vocoders

#### Diffwave

**Paper**: https://arxiv.org/abs/2009.09761

**Key Features**:
- Diffusion-based generation
- High-quality audio
- Slower inference

**Configuration**:
```yaml
vocoder:
  type: diffwave
  num_steps: 50
  sampler: ddim
```

### Auto-Regressive Vocoders

#### WaveNet

**Paper**: https://arxiv.org/abs/1609.03499

**Key Features**:
- Dilated convolutions
- Causal generation
- High quality but slow

#### WaveRNN

**Paper**: https://arxiv.org/abs/1802.08435

**Key Features**:
- Efficient RNN-based
- Faster than WaveNet
- Still slower than GAN-based

## Vocoder Training

### Training a Custom Vocoder

```bash
python bins/train.py \
  --config config/vocoder/hifigan/train.yaml \
  --exp-name my_vocoder
```

Configuration:

```yaml
model:
  type: HiFiGAN
  generator:
    channels: 512
    upsample_scales: [8, 8, 2, 2]
    upsample_kernel_sizes: [16, 16, 4, 4]

  discriminator:
    scales: 3
    periods: [2, 3, 5, 7, 11]

data:
  dataset: libritts
  sample_rate: 16000
  batch_size: 32

train:
  learning_rate_g: 0.0002
  learning_rate_d: 0.0002
  betas: [0.5, 0.9]
  max_epochs: 100
```

### Vocoder Evaluation

```python
from amphion.models import build_vocoder
from amphion.evaluation import PESQMetrics

# Load vocoder
vocoder = build_vocoder('hifigan')

# Convert mel-spectrogram to audio
mel_spec = load_mel_spectrogram('test.pt')
audio = vocoder(mel_spec)

# Evaluate
pesq_metrics = PESQMetrics()
score = pesq_metrics.compute(reference_audio, audio)
```

## Vocoder Selection Guide

| Vocoder | Quality | Speed | Size | Best For |
|---------|---------|-------|------|----------|
| HiFi-GAN | High | Very Fast | Small | General purpose |
| NSF-HiFiGAN | High | Very Fast | Small | Pitch-critical tasks |
| BigVGAN | Very High | Fast | Medium | High-quality output |
| APNet | High | Very Fast | Small | Efficient systems |
| MelGAN | Medium | Very Fast | Tiny | Mobile/edge |
| WaveGlow | High | Medium | Large | Parallel generation |
| Diffwave | Very High | Slow | Medium | Offline generation |
| WaveNet | Very High | Slow | Large | Research |

## Advanced Evaluation

### Custom Metrics

Implement custom metrics:

```python
from amphion.evaluation import AudioMetric

class CustomMetric(AudioMetric):
    def __init__(self):
        super().__init__()

    def compute(self, predicted, reference):
        # Your metric implementation
        return metric_value

# Use custom metric
custom = CustomMetric()
value = custom.compute(generated_audio, reference_audio)
```

### Listening Tests Integration

Amphion can organize audio samples for listening tests:

```python
from amphion.evaluation import ListeningTestOrganizer

organizer = ListeningTestOrganizer(
    models=['model1', 'model2', 'model3'],
    reference_audios=['ref1.wav', 'ref2.wav'],
    output_dir='listening_test'
)

# Generates HTML interface for MOS collection
organizer.generate_mos_interface()
```

## Resources

- **Evaluation Code**: https://github.com/open-mmlab/Amphion/tree/main/egs/metrics/
- **Paper**: https://arxiv.org/abs/2312.09911
- **Multi-Scale CQT Discriminator**: https://arxiv.org/abs/2311.14957
- **Community**: https://discord.com/invite/drhW7ajqAG

---

# Amphion: Audio, Music, and Speech Generation Toolkit

**Source:** https://github.com/open-mmlab/Amphion

## Overview

Amphion (/æmˈfaɪən/) is an open-source deep learning toolkit for audio, music, and speech generation research and development. It is designed to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation. The toolkit offers unique visualizations of classic models and architectures, providing invaluable educational resources for understanding neural audio processing.

## Purpose

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. It is designed to support multiple individual generation tasks with a unified framework and pipeline.

## Supported Tasks

Amphion provides comprehensive support for the following audio generation tasks:

- **TTS (Text-to-Speech)** - Supported
  - Convert text to natural-sounding speech
  - Multiple supported architectures with state-of-the-art performance

- **SVC (Singing Voice Conversion)** - Supported
  - Convert singing voice from one speaker/style to another
  - Multiple acoustic decoder implementations

- **VC (Voice Conversion)** - Supported
  - Zero-shot and few-shot voice conversion
  - Controllable timbre and style conversion

- **AC (Accent Conversion)** - Supported
  - Convert accents in speech while preserving content
  - Zero-shot capability for style conversion

- **TTA (Text-to-Audio)** - Supported
  - Generate audio from textual descriptions
  - Latent diffusion model architecture

- **SVS (Singing Voice Synthesis)** - In Development
  - Convert text directly to singing voice

- **TTM (Text-to-Music)** - In Development
  - Generate music from textual descriptions

## Key Features

### TTS: Text-to-Speech

Amphion achieves state-of-the-art performance on TTS systems with multiple supported architectures:

- **FastSpeech2**: Non-autoregressive architecture using feed-forward Transformer blocks
- **VITS**: End-to-end architecture with conditional VAE and adversarial learning
- **VALL-E**: Zero-shot TTS using neural codec language model with discrete codes
- **NaturalSpeech2**: Architecture using latent diffusion models for natural-sounding voices
- **Jets**: End-to-end model jointly training FastSpeech2 and HiFi-GAN with alignment
- **MaskGCT**: Fully non-autoregressive architecture eliminating explicit alignment requirements
- **Vevo-TTS**: Zero-shot TTS with controllable timbre and style

### Voice Conversion & Imitation

- **Vevo**: Zero-shot voice imitation framework with controllable timbre and style
  - Vevo-Timbre: Style-preserved voice conversion
  - Vevo-Voice: Style-converted voice conversion

- **FACodec**: Decomposes speech into subspaces for content, prosody, and timbre
  - Achieves zero-shot voice conversion

- **Noro**: Noise-robust zero-shot voice conversion system
  - Handles noisy reference speeches
  - Dual-branch reference encoding

### Singing Voice Conversion

Amphion implements multiple speaker-agnostic feature representations:

- **Content Features**: From WeNet, Whisper, and ContentVec pretrained models
- **Prosody Features**: F0 and energy extraction
- **Acoustic Decoders**:
  - Diffusion-based: DiffWaveNetSVC, DiffComoSVC (Consistency Model)
  - Transformer-based: TransformerSVC (encoder-only, non-autoregressive)
  - VAE/Flow-based: VitsSVC (VITS-like architecture)

### Text-to-Audio Generation

- Latent diffusion model architecture
- Two-stage training: VAE (AutoencoderKL) and conditional diffusion (AudioLDM)
- Similar to AudioLDM, Make-an-Audio, and AUDIT frameworks

### Neural Audio Codecs

- **DualCodec**: Low-frame-rate (12.5Hz or 25Hz) codec with SSL features
- **FACodec**: Speech decomposition for content, prosody, and timbre

### Vocoders

Amphion supports multiple neural vocoder architectures:

- **GAN-based**: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet
- **Flow-based**: WaveGlow
- **Diffusion-based**: Diffwave
- **Auto-regressive**: WaveNet, WaveRNN
- **Multi-Scale Constant-Q Transform Discriminator**: Enhancement for GAN vocoders (ICASSP 2024)

### Evaluation Metrics

Comprehensive objective evaluation capabilities:

- **F0 Modeling**: F0 Pearson Coefficients, Periodicity RMSE, Voiced/Unvoiced F1 Score
- **Energy Modeling**: Energy RMSE, Energy Pearson Coefficients
- **Intelligibility**: Character/Word Error Rate (via Whisper)
- **Spectrogram Distortion**: FAD, MCD, Multi-Resolution STFT Distance, PESQ, STOI
- **Speaker Similarity**: Cosine similarity (RawNet3, Resemblyzer, WeSpeaker, WavLM)

### Datasets

Amphion provides unified data preprocessing for open-source datasets:

- AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK
- **Emilia Dataset**: Exclusive support for in-the-wild speech data
  - 101k+ hours of multilingual speech data
  - Latest Emilia-Large: 200,000+ hours (Emilia + Emilia-YODAS)
- **Emilia-Pipe**: Preprocessing pipeline for in-the-wild speech data

### Visualization Tools

- **SingVisio**: Interactive visualization tool for diffusion models in singing voice conversion
  - Educational resource for understanding model internals
  - Facilitates understandable research

## Latest Releases

### Amphion v0.2 (January 2025)
- Comprehensive technical report covering 2024 updates
- Emilia-Large dataset (200k+ hours)
- Enhanced multilingual support
- Multiple new model releases

### Recent Model Releases
- **DualCodec** (May 2025): Low-frame-rate neural audio codec
- **Vevo1.5** (April 2025): Unified speech and singing voice generation
- **Metis** (February 2025): Foundation model for unified speech generation
- **MaskGCT** (October 2024): State-of-the-art non-autoregressive TTS
- **Vevo** (December 2024): Zero-shot voice imitation framework

## Pre-trained Models

Amphion provides pre-trained models available on:
- HuggingFace: https://huggingface.co/amphion
- ModelScope: https://modelscope.cn/organization/amphion
- OpenXLab: https://openxlab.org.cn/usercenter/Amphion

All models are released under the MIT License for both research and commercial use.

## Community & Resources

- **GitHub**: https://github.com/open-mmlab/Amphion
- **Discord**: Join the community at https://discord.com/invite/drhW7ajqAG
- **Paper**: https://arxiv.org/abs/2312.09911
- **Website**: https://amphion.dev
- **HuggingFace Demos**: Interactive demos available for multiple models

## License

Amphion is released under the MIT License, allowing free use for both research and commercial applications.

---

# Amphion Installation Guide

## Overview

Amphion can be installed through two methods:
1. Setup Installer (Python environment)
2. Docker Image (containerized with GPU support)

## Method 1: Setup Installer

### Prerequisites

- Git
- Conda (Anaconda or Miniconda)
- Python 3.9+ (recommended: 3.9.15)
- CUDA toolkit (for GPU support)
- cuDNN (for GPU support)

### Installation Steps

#### Step 1: Clone the Repository

```bash
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion
```

#### Step 2: Create Conda Environment

```bash
conda create --name amphion python=3.9.15
conda activate amphion
```

#### Step 3: Install Dependencies

Amphion provides an installation script that handles all Python package dependencies:

```bash
sh env.sh
```

This script will install:
- Core dependencies (PyTorch, torchaudio, librosa)
- Model dependencies (diffusers, transformers, julius)
- Audio processing (soundfile, scipy, matplotlib)
- Data processing (numpy, pandas)
- ML utilities (lightning, tensorboard, wandb)

#### Step 4: Verify Installation

To verify your installation is working:

```bash
python -c "import amphion; print('Amphion installed successfully')"
```

### Troubleshooting

**CUDA/GPU Issues**: If you encounter CUDA errors, ensure you have:
- Compatible NVIDIA drivers installed
- CUDA toolkit matching your PyTorch installation
- cuDNN properly configured

**Memory Issues**: If you encounter out-of-memory errors:
- Reduce batch size in configuration files
- Use gradient accumulation
- Enable gradient checkpointing

## Method 2: Docker Installation

### Prerequisites

- Docker
- NVIDIA Driver (latest version recommended)
- NVIDIA Container Toolkit
- CUDA toolkit (compatible with your NVIDIA driver)

### Installation Steps

#### Step 1: Install Docker Dependencies

If not already installed:

```bash
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
```

#### Step 2: Clone Repository and Pull Docker Image

```bash
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
```

#### Step 3: Run Docker Container

Run the Docker container with GPU support:

```bash
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
```

#### Step 4: Mount Datasets

To use your own datasets with Docker, mount them as volumes:

```bash
docker run --runtime=nvidia --gpus all -it \
  -v .:/app \
  -v /path/to/datasets:/app/datasets \
  realamphion/amphion
```

For detailed Docker volume mounting instructions, see:
- [Mount Dataset in Docker Container](../egs/datasets/docker.md)
- [Docker Documentation](https://docs.docker.com/engine/reference/commandline/container_run/#volume)

### Available Docker Images

The official Docker image includes:
- Pre-installed PyTorch with CUDA support
- All Amphion dependencies and models
- NVIDIA CUDA runtime
- Ready-to-use development environment

## System Requirements

### Minimum Requirements

- **CPU**: 4+ cores
- **RAM**: 8GB (16GB+ recommended)
- **GPU**: NVIDIA GPU with 2GB+ VRAM (for inference)
  - 8GB+ VRAM recommended for training
- **Disk Space**: 20GB+ for models and datasets

### Recommended Configuration

- **CPU**: 8+ cores
- **RAM**: 32GB+
- **GPU**: NVIDIA GPU with 24GB+ VRAM (for training)
  - RTX 3090, RTX 4090, H100, or A100 recommended
- **Storage**: SSD with 100GB+ free space

## Quick Start After Installation

### Python Usage

After installation, use Amphion in your Python code:

```python
from amphion.utils import load_config
from amphion.models import build_model

# Load configuration
config = load_config('path/to/config.yaml')

# Build and use model
model = build_model(config)
```

### Command Line Usage

Access Amphion's CLI tools:

```bash
# Activate environment
conda activate amphion

# Run preprocessing
python bins/data/preprocess_dataset.py --config config/...

# Train a model
python bins/train.py --config config/...

# Inference
python bins/inference.py --config config/...
```

### Docker Usage

Inside Docker container:

```bash
cd /app

# Run preprocessing
python bins/data/preprocess_dataset.py --config config/...

# Train a model
python bins/train.py --config config/...

# Exit container
exit
```

## Configuration Files

Amphion uses YAML configuration files for all tasks. Configuration templates are located in:

```
Amphion/
├── config/
│   ├── tts/           # Text-to-Speech configs
│   ├── svc/           # Singing Voice Conversion configs
│   ├── vc/            # Voice Conversion configs
│   ├── tta/           # Text-to-Audio configs
│   └── vocoder/       # Vocoder configs
```

## Environment Variables

Optional environment variables for advanced configuration:

```bash
# Set number of CPU threads
export OMP_NUM_THREADS=8

# Set CUDA device
export CUDA_VISIBLE_DEVICES=0,1

# Enable mixed precision
export AMPHION_MIXED_PRECISION=fp16
```

The `env.sh` script is provided to set up common environment variables:

```bash
source env.sh
```

## Next Steps

After successful installation:

1. **Choose a Task**:
   - [Text-to-Speech (TTS)](../egs/tts/README.md)
   - [Singing Voice Conversion (SVC)](../egs/svc/README.md)
   - [Voice Conversion (VC)](../models/vc/vevo/README.md)
   - [Text-to-Audio (TTA)](../egs/tta/README.md)

2. **Download Datasets**: Check available preprocessed datasets in `egs/datasets/README.md`

3. **Run Examples**: Start with provided recipes and examples

4. **Join Community**: Participate in discussions on [Discord](https://discord.com/invite/drhW7ajqAG)

## Getting Help

- **GitHub Issues**: https://github.com/open-mmlab/Amphion/issues
- **Discord Community**: https://discord.com/invite/drhW7ajqAG
- **Documentation**: https://amphion.dev
- **Papers & Reports**: https://arxiv.org/search/?query=amphion

---

# Amphion Quick Reference Guide

## Repository Structure

```
Amphion/
├── bins/                  # Command-line scripts
│   ├── train.py          # Training entrypoint
│   ├── inference.py      # Inference entrypoint
│   └── metrics/          # Evaluation scripts
├── config/               # Configuration files (YAML)
│   ├── tts/             # Text-to-Speech configs
│   ├── svc/             # Singing Voice Conversion configs
│   ├── vc/              # Voice Conversion configs
│   ├── tta/             # Text-to-Audio configs
│   └── vocoder/         # Vocoder configs
├── models/              # Model implementations
│   ├── tts/            # TTS models
│   ├── vc/             # VC models (Vevo, FACodec, etc.)
│   ├── svc/            # SVC models
│   ├── codec/          # Neural codecs
│   └── vocoders/       # Vocoders
├── modules/            # Neural network modules
├── preprocessors/      # Data preprocessing
│   └── Emilia/         # Emilia dataset preprocessing
├── evaluation/         # Evaluation metrics
├── egs/               # Example recipes
│   ├── tts/           # TTS recipes
│   ├── svc/           # SVC recipes
│   ├── vc/            # VC recipes
│   ├── tta/           # TTA recipes
│   ├── datasets/      # Dataset instructions
│   ├── metrics/       # Evaluation guides
│   └── visualization/ # SingVisio visualization
└── pretrained/        # Pre-trained model checkpoints
```

## Essential Commands

### Installation

```bash
# Clone repository
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Setup Python environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install dependencies
sh env.sh

# Docker alternative
docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
```

### Data Preparation

```bash
# Generic dataset preprocessing
python bins/data/preprocess_dataset.py \
  --config config/tts/VITS/prepare_libritts.yaml \
  --datasets libritts

# Emilia dataset preprocessing
python bins/data/preprocess_dataset.py \
  --config config/preprocessors/Emilia/emilia_pipe.yaml \
  --raw-data-dir /path/to/raw/audio
```

### Training

```bash
# Basic training
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_experiment

# Resume training
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_experiment \
  --resume

# Distributed training (8 GPUs)
python -m torch.distributed.launch \
  --nproc_per_node=8 \
  bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_experiment

# Mixed precision training
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_experiment \
  --mixed_precision fp16
```

### Inference

```bash
# TTS inference
python bins/inference.py \
  --config config/tts/VITS/vits.yaml \
  --checkpoint checkpoints/my_model/ckpt.pt \
  --text "Your text here" \
  --output output.wav

# Voice Conversion inference
python bins/inference.py \
  --config config/vc/vevo/vevo.yaml \
  --checkpoint checkpoints/vevo.pt \
  --source-audio source.wav \
  --reference-audio reference.wav \
  --output output.wav

# SVC inference
python bins/inference.py \
  --config config/svc/DiffComoSVC/diffcomosvc.yaml \
  --checkpoint checkpoints/svc.pt \
  --source-audio source.wav \
  --target-speaker speaker_id \
  --output output.wav
```

### Evaluation

```bash
# Evaluate model
python bins/metrics/eval.py \
  --config config/tts/VITS/vits.yaml \
  --checkpoint checkpoints/my_model/ckpt.pt \
  --test-dir test_data/ \
  --output metrics.json

# Compute FAD score
python bins/metrics/compute_fad.py \
  --generated-dir generated_audio/ \
  --reference-dir reference_audio/

# ASR evaluation (Word Error Rate)
python bins/metrics/compute_asr.py \
  --audio-dir generated_audio/ \
  --reference-text reference_text.txt
```

## Configuration Quick Reference

### Common Config Structure

```yaml
# Model definition
model:
  type: VITS  # Model architecture
  hidden_size: 384
  encoder_hidden_size: 384
  num_mels: 80

# Data loading
data:
  dataset: libritts
  data_dir: /path/to/data
  batch_size: 16
  num_workers: 4
  pin_memory: true

# Training
train:
  max_epochs: 100
  learning_rate: 1e-3
  optimizer: adam
  betas: [0.9, 0.999]
  weight_decay: 0.0
  grad_clip: 5.0
  grad_accumulation_steps: 1

# Validation
valid:
  interval: 5000
  num_samples: 10

# Checkpointing
ckpt:
  keep_last: 3
  keep_best_by_state_dict: true

# Logging
log:
  log_interval: 10
  log_tensorboard: true
```

### Task-Specific Configs

#### TTS Configuration
```yaml
model:
  type: VITS  # or FastSpeech2, VALL-E, Jets

# Speaker information (for multi-speaker)
speaker:
  num_speakers: 100
  embedding_dim: 256

# Vocoder
vocoder:
  type: hifigan
  checkpoint: pretrained/vocoders/hifigan.pt
```

#### SVC Configuration
```yaml
# Content feature extractor
acoustic_features:
  content_feature:
    type: whisper  # or weinet, contentvec

  prosody:
    extract_f0: true
    extract_energy: true

# Acoustic decoder
acoustic_decoder:
  type: DiffComoSVC

# Speaker info
speaker:
  num_speakers: 100
  embedding_dim: 256
```

#### VC Configuration
```yaml
model:
  type: Vevo  # or FACodec, Noro

# Inference settings
inference:
  mode: timbre  # or voice
  pitch_scale: 1.0
  energy_scale: 1.0
```

## Quick Start Recipes

### TTS with VITS

```bash
# 1. Prepare data
python bins/data/preprocess_dataset.py \
  --config config/tts/VITS/prepare_libritts.yaml \
  --datasets libritts

# 2. Train
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name vits_libritts

# 3. Infer
python bins/inference.py \
  --config config/tts/VITS/vits.yaml \
  --checkpoint checkpoints/vits_libritts/ckpt.pt \
  --text "Hello, this is a test." \
  --output output.wav
```

### SVC with DiffComoSVC

```bash
# 1. Prepare dataset
python bins/data/preprocess_dataset.py \
  --config config/svc/prepare_svcc.yaml \
  --datasets svcc

# 2. Extract features
python bins/data/extract_acoustic_features.py \
  --config config/svc/extract_whisper_feature.yaml

# 3. Train
python bins/train.py \
  --config config/svc/DiffComoSVC/diffcomosvc.yaml \
  --exp-name diffcomosvc_svcc

# 4. Infer
python bins/inference.py \
  --config config/svc/DiffComoSVC/diffcomosvc.yaml \
  --checkpoint checkpoints/diffcomosvc_svcc/ckpt.pt \
  --source-audio source.wav \
  --target-speaker 1 \
  --output output.wav
```

### Voice Conversion with Pre-trained Model

```bash
# Download and use pre-trained Vevo
python -c "
from amphion.models import build_model
model = build_model(config)
model.load_pretrained('amphion/vevo')
"

# Run inference
python bins/inference.py \
  --config config/vc/vevo/vevo.yaml \
  --checkpoint pretrained/vevo/vevo.pt \
  --source-audio source.wav \
  --reference-audio reference.wav \
  --output output.wav
```

## Common File Locations

| Item | Location |
|------|----------|
| Training scripts | `bins/train.py` |
| Inference scripts | `bins/inference.py` |
| Evaluation scripts | `bins/metrics/` |
| Data preprocessing | `bins/data/` |
| Model weights | `pretrained/` |
| TTS configs | `config/tts/` |
| SVC configs | `config/svc/` |
| VC configs | `config/vc/` |
| TTA configs | `config/tta/` |
| Vocoder configs | `config/vocoder/` |
| Model code | `models/` |
| Dataset recipes | `egs/datasets/` |
| Example configs | `egs/<task>/` |

## Environment Variables

```bash
# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Set number of CPU threads
export OMP_NUM_THREADS=8

# Enable mixed precision
export AMPHION_MIXED_PRECISION=fp16

# Set random seed for reproducibility
export PYTHONHASHSEED=0

# Enable deterministic behavior
export CUBLAS_WORKSPACE_CONFIG=:16:8
```

## Dataset Paths

### Pre-configured Datasets

```bash
# Place datasets in these locations for automatic detection:
./data/libritts/      # LibriTTS dataset
./data/ljspeech/      # LJSpeech dataset
./data/vctk/          # VCTK dataset
./data/svcc/          # SVCC dataset
./data/opensinger/    # OpenSinger dataset
./data/emilia/        # Emilia dataset
```

### Custom Dataset Format

```
custom_dataset/
├── train/
│   ├── speaker_001/
│   │   ├── audio_001.wav
│   │   ├── audio_002.wav
│   │   └── transcription.txt
│   └── speaker_002/
└── val/
    └── speaker_001/
```

## Pre-trained Model Hub

### HuggingFace Models

```bash
# Access pre-trained models
from transformers import AutoModel

# Text-to-Speech models
amphion/maskgct                # State-of-the-art TTS
amphion/vall-e                 # Zero-shot TTS
amphion/vits-libritts         # VITS trained on LibriTTS

# Voice Conversion models
amphion/vevo                   # Zero-shot VC
amphion/naturalspeech3_facodec # FACodec
amphion/metis                  # Foundation model for speech

# Visit: https://huggingface.co/amphion
```

### Local Pre-trained Models

```bash
# Download pre-trained weights
cd pretrained/
# Models are automatically fetched from HuggingFace

# Or manually download:
wget https://huggingface.co/amphion/vevo/resolve/main/vevo.pt
mv vevo.pt vevo/
```

## Troubleshooting Commands

```bash
# Check installation
python -c "import amphion; print(amphion.__version__)"

# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"

# Check GPU memory
python -c "import torch; print(torch.cuda.mem_get_info())"

# List available models
ls pretrained/

# View training logs
tail -f outputs/<exp-name>/logs/train.log

# Tensorboard visualization
tensorboard --logdir outputs/<exp-name>/tensorboard
```

## Performance Tips

### Memory Optimization

```yaml
# In configuration:
train:
  gradient_accumulation_steps: 2  # Simulate larger batch
  enable_gradient_checkpointing: true
  use_cuda_amp: true              # Mixed precision

model:
  use_checkpoint: true             # Gradient checkpointing
```

### Speed Optimization

```bash
# Use faster sampler for diffusion models
inference:
  sampler: ddim                  # Faster than DDPM
  num_steps: 30                  # Fewer steps
  use_fp16: true                 # Mixed precision

# Distributed data loading
data:
  num_workers: 8                 # CPU workers for data loading
  prefetch_factor: 2
```

### Quality Optimization

```yaml
# Higher quality settings
train:
  max_epochs: 200                # Longer training
  learning_rate: 5e-4            # Lower learning rate
  weight_decay: 1e-4             # L2 regularization
  grad_clip: 1.0                 # Tighter gradient clipping

model:
  hidden_size: 512               # Larger model
  num_layers: 12                 # More layers
```

## Resources

- **Official Documentation**: https://amphion.dev
- **GitHub Repository**: https://github.com/open-mmlab/Amphion
- **Paper (v0.2)**: https://arxiv.org/abs/2501.15442
- **Paper (v0.1)**: https://arxiv.org/abs/2312.09911
- **HuggingFace Models**: https://huggingface.co/amphion
- **ModelScope**: https://modelscope.cn/organization/amphion
- **Discord Community**: https://discord.com/invite/drhW7ajqAG

---

# Singing Voice Conversion (SVC) in Amphion

## Overview

Amphion's Singing Voice Conversion (SVC) module enables the conversion of singing voice from one speaker or musical style to another. It supports multiple state-of-the-art architectures and has been the subject of peer-reviewed research published at IEEE SLT 2024.

## Architecture Overview

The SVC pipeline typically consists of three main components:

1. **Speaker-Agnostic Feature Extraction**: Extract content representations from source audio
2. **Speaker Embedding Injection**: Inject target speaker information
3. **Waveform Reconstruction**: Generate the output waveform using a vocoder

```
Source Audio → Content Features + Prosody → Acoustic Decoder → Vocoder → Target Audio
                                ↓
                          Speaker Embedding
```

## Content Feature Extraction

SVC uses speaker-agnostic representations from multiple pretrained models:

### Content Features

Extract linguistic content from audio using:

- **WeNet**: Automatic Speech Recognition (ASR) based features
  - Chinese and English support
  - Robust content representation
  - https://github.com/wenet-e2e/wenet

- **Whisper**: OpenAI's multilingual ASR model
  - Multi-language support
  - Robust to noise
  - Easy integration
  - https://github.com/openai/whisper

- **ContentVec**: Self-supervised content representation
  - Language-universal features
  - Pre-trained on multilingual data
  - https://github.com/auspicious3000/contentvec

### Prosody Features

Extract prosodic characteristics:

- **F0 (Fundamental Frequency)**: Pitch estimation
- **Energy**: Speech intensity and power

Configuration example:

```yaml
content_feature:
  type: whisper  # or weinet, contentvec
  use_frame_alignment: true

prosody:
  extract_f0: true
  extract_energy: true
```

## Speaker Embeddings

Represent target speaker characteristics:

### Speaker Look-Up Table

- Pre-computed embeddings for each speaker
- Fast inference
- Requires speaker ID at test time

### Reference Encoder (Developing)

- Extract speaker information from reference audio
- Enable zero-shot SVC
- No need for pre-computed embeddings

```python
# Using speaker embeddings
speaker_embedding = model.extract_speaker_embedding(reference_audio)
output = model.inference(source_audio, speaker_embedding)
```

## Acoustic Decoders

### Diffusion-Based Models

#### DiffWaveNetSVC

**Architecture**: Bidirectional Non-Causal Dilated CNN

**Key Features**:
- Diffusion probabilistic model framework
- Similar to WaveNet and DiffWave
- Multiple sampling algorithms support
- Deterministic inference possible

**Sampling Algorithms**:
- **DDPM** (Denoising Diffusion Probabilistic Models): Standard diffusion sampling
- **DDIM** (Denoising Diffusion Implicit Models): Faster inference
- **PNDM** (Pseudo Numerical Methods): Improved quality
- **Consistency Model**: Single-step fast inference

**Configuration**:
```yaml
acoustic_decoder:
  type: DiffWaveNetSVC
  num_layers: 20
  num_channels: 128

inference:
  sampler: ddim  # or ddpm, pndm, consistency
  steps: 50
```

#### DiffComoSVC

**Architecture**: Consistency Model based Diffusion

**Key Features**:
- Significantly faster inference than standard diffusion
- Single-step or multi-step sampling
- Maintains quality while reducing latency
- Based on consistency models

**Best For**: Real-time SVC applications

### Transformer-Based Models

#### TransformerSVC

**Architecture**: Encoder-only Non-autoregressive Transformer

**Key Features**:
- Pure attention mechanism
- Parallel decoding for fast inference
- Maintains long-range dependencies
- Simple and efficient

**Configuration**:
```yaml
acoustic_decoder:
  type: TransformerSVC
  hidden_size: 384
  num_layers: 6
  num_heads: 4
  feedforward_size: 1536
```

### VAE and Flow-Based Models

#### VitsSVC

**Architecture**: VITS-like Model with Content Features

**Key Features**:
- Variational autoencoder based
- Conditional generation
- Normalizing flow for flexible posterior
- Similar to so-vits-svc

**Paper**: https://arxiv.org/abs/2106.06103

## Waveform Synthesis (Vocoders)

After acoustic decoding, use a vocoder to generate the final waveform:

Available vocoders:
- **HiFi-GAN**: High-quality GAN-based
- **NSF-HiFiGAN**: Noise suppression enhanced
- **BigVGAN**: Large capacity GAN
- **MelGAN**: Lightweight GAN
- **WaveGlow**: Flow-based vocoder
- **Diffwave**: Diffusion-based vocoder

```yaml
vocoder:
  type: hifigan
  checkpoint: pretrained/vocoders/hifigan.pt
```

## SVC Workflow

### 1. Data Preparation

```bash
# Prepare SVC dataset
python bins/data/preprocess_dataset.py \
  --config config/svc/prepare_svcc.yaml \
  --datasets svcc

# For custom datasets:
# - Structure: speaker_id/song_name/audio.wav
# - Prepare annotations with content and prosody info
```

### 2. Feature Extraction

```bash
# Extract content features
python bins/data/extract_acoustic_features.py \
  --config config/svc/extract_whisper_feature.yaml \
  --data-dir /path/to/svc/data

# Extract prosody features
python bins/data/extract_prosody.py \
  --config config/svc/extract_prosody.yaml
```

### 3. Training

```bash
# Train SVC model
python bins/train.py \
  --config config/svc/DiffComoSVC/diffcomosvc.yaml \
  --exp-name my_svc_model

# Distributed training
python -m torch.distributed.launch \
  --nproc_per_node=8 \
  bins/train.py \
  --config config/svc/DiffComoSVC/diffcomosvc.yaml
```

### 4. Inference

```python
from amphion.models import build_model
from amphion.utils import load_config
import torch
import soundfile as sf

# Load model
config = load_config('config/svc/DiffComoSVC/diffcomosvc.yaml')
model = build_model(config)
checkpoint = torch.load('checkpoints/my_svc_model.pt')
model.load_state_dict(checkpoint['model'])
model.eval()

# Load audio
import librosa
source_audio, sr = librosa.load('source_song.wav')

# Convert voice
with torch.no_grad():
    output = model.inference(
        source_audio,
        target_speaker_id=1,
        use_fastest=False  # For DiffComoSVC
    )

# Save output
sf.write('output.wav', output.cpu().numpy(), sr)
```

## Supported Datasets

Amphion provides recipes for:

- **SVCC** (Singing Voice Conversion Challenge)
- **VCTK** (Voice Conversion Challenge)
- **M4Singer** (Chinese multi-singer dataset)
- **Opencpop** (Chinese singing dataset)
- **OpenSinger** (Multi-speaker singing)
- **Emilia**: Large-scale multilingual singing data

## Model Architecture Comparison

| Model | Type | Speed | Quality | Zero-Shot | Custom Reference |
|-------|------|-------|---------|-----------|------------------|
| DiffWaveNetSVC | Diffusion | Medium | High | No | No |
| DiffComoSVC | Consistency | Fast | High | No | No |
| TransformerSVC | Transformer | Fast | Medium | No | No |
| VitsSVC | VAE/Flow | Fast | High | No | No |

## Configuration Structure

```yaml
# Content and prosody features
acoustic_features:
  content_feature:
    type: whisper  # whisper, weinet, contentvec

  prosody:
    extract_f0: true
    extract_energy: true

# Acoustic decoder
acoustic_decoder:
  type: DiffComoSVC  # Model selection
  hidden_size: 256
  num_layers: 20

# Speaker embedding
speaker_embedding:
  type: lookup  # or reference_encoder
  num_speakers: 100
  embedding_dim: 256

# Vocoder
vocoder:
  type: hifigan
  checkpoint: pretrained/vocoders/hifigan.pt

# Training
train:
  batch_size: 16
  num_epochs: 100
  learning_rate: 1e-3
  optimizer: adamw

# Inference
inference:
  sampler: ddim  # For diffusion models
  sampler_steps: 50
```

## Advanced Features

### Multiple Content Features

Amphion investigates multiple content representations in the official paper:

```yaml
# Use multiple content features
acoustic_features:
  content_features:
    - whisper
    - contentvec
    - weinet
  fusion_method: concatenate  # or weighted
```

### Zero-Shot SVC (Reference Encoder)

Extract speaker info from reference audio at inference time:

```python
reference_audio, sr = librosa.load('reference_song.wav')
speaker_embedding = model.extract_speaker_embedding(reference_audio)

output = model.inference(
    source_audio,
    speaker_embedding=speaker_embedding
)
```

## Evaluation Metrics

Evaluate SVC models using:

- **MCD (Mel-Cepstral Distortion)**: Spectral similarity
- **FAD (Frechet Audio Distance)**: Audio distribution distance
- **PESQ**: Speech quality assessment
- **Similarity Score**: Via speaker verification models (RawNet3, WeSpeaker)
- **Intelligibility**: Via ASR (Whisper)

## Research Insights

From the Amphion SLC 2024 paper on multiple content features:

- **WeNet**: Best for content representation in singing
- **Whisper**: Good multilingual support
- **ContentVec**: Competitive performance

Combining multiple features can improve overall quality.

## Troubleshooting

### Poor Output Quality

1. Check content feature extraction:
   - Verify alignment between source and extracted features
   - Try different content models

2. Verify speaker embeddings:
   - Ensure adequate speaker data
   - Check speaker embedding dimensions

3. Adjust vocoder:
   - Use higher-quality vocoder
   - Fine-tune vocoder on target domain

### Artifacts and Noise

1. Increase training duration
2. Use gradient accumulation for larger effective batch size
3. Try different sampler (DDIM → PNDM)
4. Increase sampler steps

### Slow Inference

- Use DiffComoSVC for fast diffusion
- Use fewer sampler steps
- Reduce audio length
- Use GPU acceleration

## Resources

- **GitHub Recipe**: https://github.com/open-mmlab/Amphion/egs/svc/
- **Paper**: https://arxiv.org/abs/2310.11160
- **Demo**: https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html
- **Community**: https://discord.com/invite/drhW7ajqAG

---

# Text-to-Audio (TTA) in Amphion

## Overview

Amphion's Text-to-Audio (TTA) module enables generation of diverse audio content from natural language descriptions. It uses a latent diffusion model architecture similar to AudioLDM, Make-an-Audio, and AUDIT.

## Architecture Overview

The TTA system uses a two-stage approach:

### Stage 1: Latent Space Learning

Train an autoencoder to compress audio into a latent space:

```
Raw Audio → Encoder → Latent Codes → Decoder → Reconstructed Audio
                      (Compressed)
```

**Component**: `AutoencoderKL` in Amphion
- Variational autoencoder with KL divergence
- Compresses audio by ~4x
- Learns meaningful latent representations

### Stage 2: Conditional Diffusion in Latent Space

Train a diffusion model to generate latent codes conditioned on text:

```
Text Description → Text Encoder → CLIP/CLAP embeddings
                                    ↓
                            Diffusion Model
                                    ↓
              Latent Codes → Decoder → Generated Audio
```

**Component**: `AudioLDM` in Amphion
- Conditional latent diffusion model
- Text-conditioned generation
- Multiple sampling strategies

## TTA Capabilities

### Diverse Audio Generation

Generate different types of audio from descriptions:

- **Sound Effects**: Thunder, water splash, door knock
- **Music**: Ambient, electronic, acoustic styles
- **Environmental Audio**: Forest, traffic, rain sounds
- **Speech**: Various prosody and emotion
- **Hybrid Content**: Mixed audio scenarios

### Control and Conditioning

Fine-grained control over generation:

- **Text Prompts**: Descriptive text for generation
- **Negative Prompts**: Specify unwanted characteristics
- **Duration Control**: Control output length
- **Style Control**: Specify audio style/genre
- **Intensity Control**: Adjust generation strength

## TTA Workflow

### 1. Model Architecture Setup

Configure the two-stage model:

```yaml
# Stage 1: VAE (AutoencoderKL)
autoencoder_kl:
  type: AutoencoderKL
  in_channels: 1
  out_channels: 1
  latent_channels: 8
  hidden_channels: 128

# Stage 2: Diffusion Model
diffusion_model:
  type: AudioLDM
  latent_channels: 8
  text_encoder: t5  # or clap, clip
  num_steps: 1000  # Diffusion steps
```

### 2. Data Preparation

Prepare audio-text pairs:

```bash
# Directory structure
dataset/
├── audio/
│   ├── sound_001.wav
│   ├── sound_002.wav
│   └── ...
└── text_descriptions/
    ├── sound_001.txt
    ├── sound_002.txt
    └── ...

# Each text file contains description of corresponding audio
```

Preprocess audio:

```bash
python bins/data/preprocess_tta.py \
  --audio-dir path/to/audio \
  --text-dir path/to/descriptions \
  --output-dir processed_data
```

### 3. Stage 1: Train AutoencoderKL

First, train the VAE to learn latent representation:

```bash
python bins/train.py \
  --config config/tta/autoencoderkl.yaml \
  --exp-name tta_vae
```

Configuration:

```yaml
model:
  type: AutoencoderKL
  # ... architecture parameters

data:
  dataset: audio_descriptions
  batch_size: 32
  num_workers: 4

train:
  max_epochs: 50
  learning_rate: 1e-3
  loss_type: mse  # Reconstruction loss
```

### 4. Stage 2: Train AudioLDM

After VAE training, train the diffusion model:

```bash
python bins/train.py \
  --config config/tta/audioldm.yaml \
  --exp-name tta_diffusion
```

Configuration:

```yaml
model:
  type: AudioLDM
  # ... architecture parameters
  pretrained_vae: path/to/vae_checkpoint.pt  # From stage 1

# Text encoder for conditioning
text_encoder:
  type: t5  # or clap, clip
  model_name: t5-base
  freeze_encoder: false

data:
  batch_size: 16

train:
  max_epochs: 100
  learning_rate: 5e-5
  # Diffusion training specifics
```

### 5. Inference

Generate audio from text descriptions:

```python
from amphion.models import build_model
from amphion.utils import load_config
import torch
import torchaudio

# Load models
config = load_config('config/tta/audioldm.yaml')
vae = build_model(config.vae_config)
diffusion_model = build_model(config.diffusion_config)

# Load checkpoints
vae.load_state_dict(torch.load('checkpoints/vae.pt'))
diffusion_model.load_state_dict(torch.load('checkpoints/diffusion.pt'))

vae.eval()
diffusion_model.eval()

# Generate audio
text_prompt = "A dog barking in the distance with ambient traffic noise"

with torch.no_grad():
    # Text encoding
    text_embeddings = diffusion_model.encode_text(text_prompt)

    # Diffusion sampling in latent space
    latent_codes = diffusion_model.sample(
        embeddings=text_embeddings,
        num_steps=50,
        guidance_scale=7.5
    )

    # Decode to audio
    audio = vae.decode(latent_codes)

# Save generated audio
torchaudio.save('output.wav', audio.squeeze(0), 16000)
```

### 6. Advanced Inference Options

#### Negative Prompts

Specify what NOT to generate:

```python
output = diffusion_model.sample(
    prompt="Dog barking",
    negative_prompt="cat, bird, quiet",
    guidance_scale=7.5
)
```

#### Classifier-Free Guidance

Control generation strength:

```python
output = diffusion_model.sample(
    prompt="Thunder storm with heavy rain",
    guidance_scale=10.0  # Higher = stronger adherence to prompt
)
```

#### Sampling Methods

Different diffusion samplers:

```python
# DDPM (standard)
output = diffusion_model.sample(sampler='ddpm', num_steps=1000)

# DDIM (faster)
output = diffusion_model.sample(sampler='ddim', num_steps=50)

# PNDM (quality + speed balance)
output = diffusion_model.sample(sampler='pndm', num_steps=50)

# Euler
output = diffusion_model.sample(sampler='euler', num_steps=30)
```

#### Seed Control

Reproducible generation:

```python
torch.manual_seed(42)
output1 = diffusion_model.sample(prompt="dog barking")

torch.manual_seed(42)
output2 = diffusion_model.sample(prompt="dog barking")
# output1 and output2 are identical
```

## Text Encoders

TTA can use different text encoders:

### T5 (Text-to-Text Transfer Transformer)

```python
text_encoder = T5Tokenizer.from_pretrained('t5-base')
embeddings = text_encoder('A dog barking')
# Shape: [1, seq_length, 768]
```

### CLAP (Contrastive Language-Audio Pre-training)

```python
# CLAP embeddings are audio-aligned
text_encoder = CLAPTextEncoder()
embeddings = text_encoder('A dog barking')
# Shape: [1, 512] - audio-aligned representations
```

### CLIP (Vision-Language Model)

Alternative multi-modal conditioning

## Supported Datasets

Amphion supports TTA training on:

- **AudioCaps**: 49k audio clips with captions
- **Clotho**: 5k audio samples with multiple descriptions
- **Emilia**: Large-scale speech descriptions
- **Custom Datasets**: With proper annotation format

Dataset structure:

```yaml
dataset:
  name: audiocaps
  root_dir: /path/to/audiocaps
  split: train  # or val, test

  # Preprocessing
  preprocessing:
    sample_rate: 16000
    num_mels: 64
    n_fft: 400
    hop_length: 160
```

## Configuration Structure

Complete TTA configuration:

```yaml
# Stage 1: VAE Configuration
vae_config:
  model:
    type: AutoencoderKL
    in_channels: 1
    latent_channels: 8
    hidden_channels: 128
    num_res_blocks: 2

  train:
    learning_rate: 1e-3
    batch_size: 32
    max_epochs: 50

# Stage 2: Diffusion Configuration
diffusion_config:
  model:
    type: AudioLDM
    latent_channels: 8
    hidden_channels: 512
    num_layers: 24
    attention_heads: 8

  text_encoder:
    type: t5  # or clap
    freeze: false

  diffusion:
    beta_schedule: linear
    num_steps: 1000

  train:
    learning_rate: 5e-5
    batch_size: 16
    max_epochs: 100
    warmup_steps: 5000

# Data configuration
data:
  dataset: audiocaps
  sample_rate: 16000
  num_mels: 64

# Inference configuration
inference:
  sampler: ddim
  num_steps: 50
  guidance_scale: 7.5
```

## Performance Metrics

Evaluate TTA quality using:

- **FAD (Frechet Audio Distance)**: Audio distribution similarity
- **KL Divergence**: Distribution divergence metric
- **PESQ**: Perceived speech quality (for speech-like audio)
- **Inception Score**: Diversity and quality metric
- **Text Alignment Score**: How well generated audio matches text

## Troubleshooting

### Poor Audio Quality

1. **Increase training**: More epochs, larger dataset
2. **Improve text descriptions**: More detailed, specific prompts
3. **Adjust guidance scale**: Higher values (7.5-15.0)
4. **Try different sampler**: PNDM often better than DDIM

### Mode Collapse (Repetitive Outputs)

1. Increase diversity regularization
2. Use higher temperature in sampling
3. Augment training data with more diverse examples

### Slow Inference

1. Use fewer diffusion steps (DDIM with 30-50 steps)
2. Use GPU acceleration
3. Reduce audio quality (lower sample rate)

### Training Instability

1. Lower learning rate
2. Smaller batch size
3. Gradient clipping
4. Warm-up scheduler

## Advanced Topics

### Fine-tuning Pre-trained Models

```bash
python bins/train.py \
  --config config/tta/audioldm_finetune.yaml \
  --pretrained-diffusion pretrained/audioldm.pt \
  --custom-data path/to/custom/data
```

### Conditioning on Audio Features

Additional conditioning beyond text:

```python
# Condition on audio duration
output = diffusion_model.sample(
    prompt="dog barking",
    duration=3.0  # 3 seconds
)

# Condition on loudness
output = diffusion_model.sample(
    prompt="dog barking",
    loudness_db=-10  # Target loudness
)
```

### Audio Style Transfer

Transfer audio style while maintaining content:

```python
# Reference audio for style
style_audio, sr = librosa.load('reference.wav')
style_embeddings = diffusion_model.extract_style(style_audio)

# Generate with style
output = diffusion_model.sample(
    prompt="dog barking",
    style_embeddings=style_embeddings
)
```

## Research Background

TTA in Amphion is based on:

- **AudioLDM**: Latent diffusion for audio generation (2301.12503)
- **Make-an-Audio**: Large-scale audio generation (2301.12661)
- **AUDIT**: Audio understanding through diffusion (2304.00830)

These models showed that latent diffusion is highly effective for audio synthesis.

## Resources

- **GitHub Recipe**: https://github.com/open-mmlab/Amphion/egs/tta/
- **Beginner Recipe**: https://github.com/open-mmlab/Amphion/egs/tta/RECIPE.md
- **Amphion Paper**: https://arxiv.org/abs/2304.00830
- **AudioLDM Paper**: https://arxiv.org/abs/2301.12503
- **Community**: https://discord.com/invite/drhW7ajqAG

---

# Text-to-Speech (TTS) in Amphion

## Overview

Amphion's Text-to-Speech (TTS) module provides state-of-the-art text-to-speech capabilities with multiple supported architectures. The TTS system converts natural language text into high-quality synthesized speech with controllable prosody and speaker characteristics.

## Supported TTS Models

### 1. FastSpeech2

**Architecture**: Non-autoregressive Transformer-based

**Key Features**:
- Feed-forward Transformer blocks
- Faster inference than autoregressive models
- Supports multiple speakers
- Duration prediction for prosody control
- Pitch and energy prediction

**Best For**: Real-time TTS applications, multi-speaker synthesis

**Configuration Location**: `config/tts/FastSpeech2/`

### 2. VITS (Variational Inference with adversarial Learning for end-to-end Text-to-Speech)

**Architecture**: End-to-end with Conditional VAE and Adversarial Learning

**Key Features**:
- Conditional variational autoencoder
- Adversarial training with discriminator
- Integrated vocoder for waveform generation
- Excellent voice quality
- Supports multiple speakers

**Best For**: High-quality speech synthesis, end-to-end training

**Paper**: https://arxiv.org/abs/2106.06103

**Configuration Location**: `config/tts/VITS/`

### 3. VALL-E (Voice Across Languages Language Encoding)

**Architecture**: Neural Codec Language Model with Discrete Codes

**Key Features**:
- Zero-shot TTS capabilities
- Uses discrete audio tokens
- Few-shot voice adaptation
- Multilingual support
- Large-scale pre-training

**Best For**: Zero-shot voice cloning, multilingual synthesis

**Paper**: https://arxiv.org/abs/2301.02111

**Configuration Location**: `config/tts/VALLE/`

### 4. NaturalSpeech2

**Architecture**: Latent Diffusion Model

**Key Features**:
- Diffusion-based generation
- Natural prosody modeling
- Improved speech quality
- Controllable generation
- Superior naturalness

**Best For**: Natural-sounding speech, research and development

**Paper**: https://arxiv.org/abs/2304.09116

**Configuration Location**: `config/tts/NaturalSpeech2/`

### 5. Jets (Joint End-to-end Text-to-Speech)

**Architecture**: Joint Training of FastSpeech2 and HiFi-GAN

**Key Features**:
- Joint optimization of acoustic model and vocoder
- Alignment module for duration prediction
- End-to-end training
- Improved consistency between stages

**Best For**: Unified acoustic and vocoder training

**Configuration Location**: `config/tts/Jets/`

### 6. MaskGCT (Masked Generator-Conditioner-Target)

**Architecture**: Fully Non-autoregressive Architecture

**Key Features**:
- Eliminates explicit text-speech alignment requirements
- Fully non-autoregressive generation
- State-of-the-art performance
- Zero-shot capabilities
- Fast inference

**Best For**: Fast, alignment-free TTS, zero-shot synthesis

**Paper**: https://arxiv.org/abs/2409.00750

**Availability**: Pre-trained models on HuggingFace and ModelScope

### 7. Vevo-TTS

**Architecture**: Autoregressive + Flow-Matching Transformer

**Key Features**:
- Zero-shot TTS with controllable timbre and style
- Flexible voice control
- Speech and singing voice synthesis
- Multiple voice aspects controllable
- Style transfer capabilities

**Best For**: Controllable zero-shot TTS, voice cloning with style control

**Paper**: https://openreview.net/pdf?id=anQDiQZhDP

**Configuration Location**: `models/vc/vevo/`

## Common TTS Workflow

### 1. Data Preparation

```bash
# Prepare your dataset
cd Amphion
python bins/data/preprocess_dataset.py \
  --config config/tts/VITS/prepare_libritts.yaml \
  --datasets libritts

# For custom datasets, modify the configuration file to point to your data
```

### 2. Training

```bash
# Train a TTS model
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_tts_model

# Resume from checkpoint
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name my_tts_model \
  --resume
```

### 3. Inference

```python
from amphion.models import build_model
from amphion.utils import load_config
import torch

# Load model
config = load_config('config/tts/VITS/vits.yaml')
model = build_model(config)
checkpoint = torch.load('path/to/checkpoint.pt')
model.load_state_dict(checkpoint['model'])
model.eval()

# Generate speech
with torch.no_grad():
    text = "Hello, this is a test."
    output = model.inference(text)
```

### 4. Evaluation

```bash
# Evaluate TTS model
python bins/metrics/eval.py \
  --config config/tts/VITS/vits.yaml \
  --checkpoint path/to/checkpoint.pt
```

## Configuration Structure

TTS configurations follow this general structure:

```yaml
# Model architecture
model:
  type: VITS  # or FastSpeech2, VALL-E, etc.
  hidden_size: 384
  encoder_hidden_size: 384
  # ... model-specific parameters

# Data configuration
data:
  dataset: libritts
  data_dir: /path/to/data
  batch_size: 16
  num_workers: 4

# Training configuration
train:
  max_epochs: 100
  learning_rate: 1e-3
  optimizer: adam
  grad_clip: 5.0

# Inference configuration
inference:
  speaker_id: 0  # For multi-speaker models
  duration_scale: 1.0
  pitch_scale: 1.0
```

## Multi-Speaker TTS

For models supporting multiple speakers:

```python
# Specify speaker ID during inference
output = model.inference(
    text="Hello world",
    speaker_id=1
)

# Or use speaker embedding
speaker_embedding = model.get_speaker_embedding(speaker_id=1)
output = model.inference(text="Hello world", speaker_embedding=speaker_embedding)
```

## Supported Datasets

Amphion supports preprocessing for these TTS datasets:

- **LibriTTS**: Large-scale multi-speaker English speech
- **LJSpeech**: Single-speaker English speech
- **VCTK**: Multi-speaker English speech
- **OpenSinger**: Chinese singing voice
- **M4Singer**: Chinese multi-speaker singing
- **Emilia**: Multilingual in-the-wild speech (101k+ hours)

## Voice Characteristics Control

Different TTS models offer various levels of control:

### Duration Control (FastSpeech2, VITS)

```python
# Speed up or slow down speech
output = model.inference(
    text="Hello world",
    duration_scale=0.8  # 20% faster
)
```

### Pitch Control

```python
# Modify fundamental frequency
output = model.inference(
    text="Hello world",
    pitch_scale=1.2  # Higher pitch
)
```

### Energy Control

```python
# Adjust speaking energy/intensity
output = model.inference(
    text="Hello world",
    energy_scale=0.9
)
```

## Vocoder Integration

Most TTS models require a vocoder to convert acoustic features to waveform:

```bash
# Train with HiFi-GAN vocoder
python bins/train.py \
  --config config/tts/VITS/vits_hifigan.yaml
```

Available vocoders:
- HiFi-GAN (default)
- BigVGAN
- NSF-HiFiGAN
- MelGAN
- WaveGlow

## Pre-trained Models

Access pre-trained models from:

- **HuggingFace**: https://huggingface.co/amphion
  - MaskGCT, Vevo, and others

- **ModelScope**: https://modelscope.cn/organization/amphion
  - MaskGCT, Metis, and others

- **Local**: Provided in `pretrained/` directory

### Using Pre-trained Models

```python
from amphion.models import build_model

# Load pre-trained VALL-E
model = build_model(config)
model.load_pretrained('amphion/vall-e')

# Inference
output = model.inference("Your text here")
```

## TTS Demo Samples

Listen to TTS samples from Amphion models:
https://openhlt.github.io/Amphion_TTS_Demo/

## Performance Metrics

TTS quality is evaluated using:

- **MOS (Mean Opinion Score)**: Subjective speech quality (scale 1-5)
- **PESQ (Perceptual Evaluation of Speech Quality)**: Objective speech quality
- **FAD (Frechet Audio Distance)**: Distribution distance metric
- **WER (Word Error Rate)**: Via ASR (Whisper)
- **Speaker Similarity**: Via speaker verification models

## Troubleshooting

### Out-of-Memory Errors

```yaml
# Reduce batch size
train:
  batch_size: 8  # Decrease from default

# Enable gradient accumulation
gradient_accumulation_steps: 2

# Enable gradient checkpointing
model:
  use_checkpoint: true
```

### Poor Voice Quality

- Ensure high-quality training data
- Increase training duration
- Adjust learning rate schedule
- Try different vocoder

### Alignment Issues (for models needing alignment)

- Use Montreal Forced Aligner (MFA) for better alignment
- Adjust forced alignment configuration
- Check data quality

## Advanced Topics

### Fine-tuning Pre-trained Models

```bash
python bins/train.py \
  --config config/tts/VITS/vits.yaml \
  --exp-name fine_tune \
  --pretrained-model-name amphion/vits-libritts \
  --resume
```

### Knowledge Distillation

Train a student model from a teacher:

```yaml
distillation:
  enabled: true
  teacher_model: vits
  temperature: 5.0
  alpha: 0.5
```

### Data Augmentation

```yaml
data_augmentation:
  speed_perturb: [0.95, 1.05]
  pitch_shift: [-2, 2]
  energy_scale: [0.9, 1.1]
```

## Resources

- **Official Docs**: https://amphion.dev
- **GitHub Repo**: https://github.com/open-mmlab/Amphion
- **Paper**: https://arxiv.org/abs/2312.09911
- **Community**: https://discord.com/invite/drhW7ajqAG

---

# Voice Conversion (VC) in Amphion

## Overview

Amphion's Voice Conversion module enables zero-shot and few-shot voice conversion with fine-grained control over speaker characteristics. It supports multiple advanced models designed for quality, naturalness, and flexibility.

## Voice Conversion Capabilities

Voice Conversion in Amphion can handle:

- **Voice Conversion (VC)**: Convert speaker identity while preserving content
- **Accent Conversion (AC)**: Change accent while maintaining speaker characteristics
- **Timbre Conversion**: Adjust voice timbre and color
- **Style Conversion**: Modify speaking/singing style

## Supported Voice Conversion Models

### 1. Vevo (VersatileVoice)

**Architecture**: Zero-shot voice imitation framework with controllable timbre and style

**Released**: December 2024

**Key Features**:
- **Zero-shot capabilities**: Convert any voice without fine-tuning
- **Controllable generation**: Independent control of timbre and style
- **Dual-branch design**:
  - **Vevo-Timbre**: Style-preserved voice conversion
  - **Vevo-Voice**: Style-converted voice conversion
- **Multi-task capability**:
  - Voice Conversion (VC)
  - Text-to-Speech (TTS)
  - Accent Conversion (AC)
  - Speech Enhancement

**Model Details**:
- Autoregressive Transformer + Flow-Matching Transformer
- Trained on Emilia dataset (101k+ hours)
- State-of-the-art zero-shot VC performance
- Pre-trained models available on HuggingFace

**Paper**: https://openreview.net/pdf?id=anQDiQZhDP

**Configuration Location**: `models/vc/vevo/`

#### Vevo Usage Example

```python
from amphion.models import build_model

# Load pre-trained Vevo model
model = build_model(config)
model.load_pretrained('amphion/vevo')

# Voice conversion with style preservation
output = model.inference(
    source_audio='input.wav',
    target_speaker_audio='reference.wav',
    mode='timbre'  # Preserve style
)

# Voice conversion with style transfer
output = model.inference(
    source_audio='input.wav',
    target_speaker_audio='reference.wav',
    mode='voice'   # Convert both timbre and style
)
```

#### Vevo1.5 (April 2025)

Enhanced version extending Vevo with:
- Unified speech and singing voice generation
- More robust generation
- Extended zero-shot capabilities
- Better accent conversion

**Blog**: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a

### 2. FACodec (Frequency Augmentative Codec)

**Architecture**: Neural audio codec with decomposition

**Key Features**:
- Decomposes speech into subspaces:
  - **Content**: Linguistic information
  - **Prosody**: Pitch and duration patterns
  - **Timbre**: Speaker-specific characteristics
- Zero-shot voice conversion
- Flexible audio manipulation
- Continuous representation

**Paper**: https://arxiv.org/abs/2403.03100

**Available Models**:
- NaturalSpeech3 FACodec
- Pre-trained checkpoint on HuggingFace

**Usage**:

```python
from amphion.models import build_model

# Load FACodec
model = build_model(config)
model.load_pretrained('amphion/naturalspeech3_facodec')

# Decompose speech
content, prosody, timbre = model.decompose(audio)

# Reconstruct with different timbre
output = model.reconstruct(content, prosody, target_timbre)
```

### 3. Noro (Noise-Robust Voice Conversion)

**Architecture**: Zero-shot voice conversion for noisy conditions

**Released**: 2024

**Key Features**:
- **Noise robustness**: Handles noisy reference speeches
- **Dual-branch reference encoding**:
  - Speech branch: Capture voice characteristics
  - Noise branch: Suppress noise information
- **Contrastive learning**: Noise-agnostic speaker loss
- Zero-shot capability
- Robust to various noise types

**Paper**: https://arxiv.org/abs/2411.19770

**Best For**: Real-world voice conversion with background noise

**Configuration Location**: `egs/vc/Noro/`

## Metis Foundation Model (February 2025)

**Purpose**: Unified speech generation foundation model

**Capabilities**:
- Zero-shot text-to-speech
- Voice conversion
- Target speaker extraction
- Speech enhancement
- Lip-to-speech

**Pre-trained Models**: Available on HuggingFace

**Paper**: https://arxiv.org/pdf/2502.03128

## VC Workflow

### 1. Voice Conversion Inference

Using Vevo for zero-shot VC:

```bash
# Command line inference
python bins/inference.py \
  --config config/vc/vevo/vevo.yaml \
  --checkpoint pretrained/vevo/vevo.pt \
  --input-audio source.wav \
  --reference-audio target_speaker.wav \
  --output-path output.wav
```

### 2. Python API

```python
from amphion.models import build_model
from amphion.utils import load_config
import librosa
import soundfile as sf

# Load model configuration
config = load_config('config/vc/vevo/vevo.yaml')
model = build_model(config)

# Load pre-trained weights
model.load_pretrained('amphion/vevo')
model.eval()

# Load audio files
source_audio, sr = librosa.load('source.wav', sr=16000)
reference_audio, _ = librosa.load('reference.wav', sr=16000)

# Perform voice conversion
with torch.no_grad():
    output = model.inference(
        source_audio=source_audio,
        target_speaker_audio=reference_audio,
        mode='timbre',  # or 'voice'
        pitch_scale=1.0,
        energy_scale=1.0
    )

# Save output
sf.write('output.wav', output.cpu().numpy(), sr)
```

## VC Applications

### 1. Voice Cloning

Clone a speaker's voice for new content:

```python
# Reference audio from target speaker
reference_audio, sr = librosa.load('speaker_voice.wav')

# Text to convert (via TTS first, or use existing speech)
source_speech, _ = librosa.load('source_speech.wav')

# Convert
output = model.voice_conversion(
    source_speech,
    reference_audio,
    mode='voice'
)
```

### 2. Accent Conversion

Modify accent while preserving speaker identity:

```python
# Reference audio with target accent
reference_audio, sr = librosa.load('target_accent.wav')

# Apply accent conversion
output = model.accent_conversion(
    source_speech,
    reference_audio
)
```

### 3. Timbre Adjustment

Modify voice characteristics:

```python
# Reference audio with desired timbre
reference_audio, sr = librosa.load('reference.wav')

# Apply timbre modification
output = model.timbre_conversion(
    source_speech,
    reference_audio,
    preservation_strength=0.7  # Balance between preservation and conversion
)
```

### 4. Real-World Applications

With Noro for robust VC:

```python
# Handle noisy reference audio
noisy_reference_audio, sr = librosa.load('noisy_reference.wav')

output = model.robust_voice_conversion(
    source_speech,
    noisy_reference_audio,
    noise_robustness=True
)
```

## Configuration Structure

```yaml
# Model architecture
model:
  type: Vevo  # or FACodec, Noro, Metis
  hidden_size: 256
  num_layers: 12

# Encoder configuration
encoder:
  type: transformer
  num_heads: 8

# Decoder configuration
decoder:
  type: transformer
  num_heads: 8

# Vocoder
vocoder:
  type: hifigan
  checkpoint: pretrained/vocoders/hifigan.pt

# Inference settings
inference:
  mode: timbre  # or voice
  pitch_scale: 1.0
  energy_scale: 1.0
  duration_scale: 1.0
```

## Audio Quality Control

Control output characteristics:

```python
output = model.inference(
    source_audio=source,
    target_speaker_audio=reference,

    # Voice quality parameters
    pitch_scale=1.0,        # Adjust pitch (0.5-2.0)
    energy_scale=1.0,       # Adjust loudness (0.5-2.0)
    duration_scale=1.0,     # Adjust speaking rate (0.5-2.0)

    # Conversion intensity
    conversion_strength=1.0,  # 0.0 = no change, 1.0 = full conversion
)
```

## Pre-trained Models

### Vevo
- HuggingFace: https://huggingface.co/amphion/Vevo
- ModelScope: https://modelscope.cn/models/amphion/Vevo
- All pre-trained on Emilia dataset

### FACodec
- HuggingFace: https://huggingface.co/amphion/naturalspeech3_facodec
- Pre-trained model checkpoint included

### Noro
- Available in repository
- Trained on multiple voice conversion datasets

### Metis
- HuggingFace: https://huggingface.co/amphion/metis
- Foundation model for unified speech generation

## Supported Datasets for Training

- **VCTK**: Multi-speaker English speech
- **TIMIT**: Phonetically balanced speech
- **VoxCeleb**: Speaker recognition dataset
- **Emilia**: Large-scale multilingual in-the-wild data
- **Custom datasets**: With proper preprocessing

## Performance Metrics

Evaluate voice conversion using:

- **MCD (Mel-Cepstral Distortion)**: Spectral similarity
- **FAD (Frechet Audio Distance)**: Perceptual quality
- **Speaker Similarity**: Via speaker verification models
  - RawNet3
  - WeSpeaker
  - WavLM
- **Content Preservation**: Via ASR (Whisper)
- **PESQ**: Voice quality metric

## Comparison with Baselines

| Model | Zero-Shot | Robustness | Speed | Quality |
|-------|-----------|-----------|-------|---------|
| Vevo | Yes | Medium | Fast | High |
| Vevo1.5 | Yes | High | Fast | Very High |
| FACodec | Yes | Medium | Fast | High |
| Noro | Yes | Very High | Medium | High |
| Metis | Yes | High | Medium | Very High |

## Advanced Features

### Multi-Reference Voice Cloning

Use multiple reference speakers:

```python
# Multiple references
references = [
    ('speaker1.wav', 0.3),
    ('speaker2.wav', 0.5),
    ('speaker3.wav', 0.2),
]

output = model.multi_reference_conversion(
    source_audio,
    references=references
)
```

### Fine-tuning for Custom Voices

```bash
python bins/train.py \
  --config config/vc/vevo/vevo_finetune.yaml \
  --pretrained-model-name amphion/vevo \
  --custom-speaker-data path/to/speaker/data
```

### Streaming/Online Voice Conversion

For real-time applications:

```python
model.set_inference_mode('streaming')
output = model.streaming_inference(
    audio_stream,  # Streaming audio input
    reference_audio,
    chunk_length=8000  # Process in chunks
)
```

## Troubleshooting

### Voice Quality Issues

1. **Artifacts**: Use higher-quality reference audio
2. **Unnatural pitch**: Adjust pitch_scale parameter
3. **Poor timbre**: Try different reference speakers
4. **Noisy output**: Increase reference audio quality or use Noro

### Inference Speed

- Use GPU acceleration
- Reduce audio length
- Use VQ-based models for faster inference

### Memory Issues

```python
# Enable gradient checkpointing if training
model.enable_gradient_checkpointing()

# Reduce batch size for inference
model.set_batch_size(1)
```

## Resources

- **GitHub VC Module**: https://github.com/open-mmlab/Amphion/tree/main/models/vc
- **Vevo Paper**: https://openreview.net/pdf?id=anQDiQZhDP
- **FACodec Paper**: https://arxiv.org/abs/2403.03100
- **Noro Paper**: https://arxiv.org/abs/2411.19770
- **Metis Paper**: https://arxiv.org/pdf/2502.03128
- **Demo**: https://versavoice.github.io/
- **Community**: https://discord.com/invite/drhW7ajqAG