# Pyannote

> - Can I apply pretrained pipelines on audio already loaded in memory?

---

# Frequently Asked Questions
- [Can I apply pretrained pipelines on audio already loaded in memory?](#can-i-apply-pretrained-pipelines-on-audio-already-loaded-in-memory)
- [Can I use gated models (and pipelines) offline?](#can-i-use-gated-models-(and-pipelines)-offline)
- [Does pyannote support streaming speaker diarization?](#does-pyannote-support-streaming-speaker-diarization)
- [How can I improve performance?](#how-can-i-improve-performance)
- [How does one spell and pronounce pyannote.audio?](#how-does-one-spell-and-pronounce-pyannoteaudio)

<a name="can-i-apply-pretrained-pipelines-on-audio-already-loaded-in-memory"></a>
## Can I apply pretrained pipelines on audio already loaded in memory?

Yes: read [this tutorial](tutorials/applying_a_pipeline.ipynb) until the end.

<a name="can-i-use-gated-models-(and-pipelines)-offline"></a>
## Can I use gated models (and pipelines) offline?

**Short answer**: yes, see [this tutorial](tutorials/applying_a_model.ipynb) for models and [that one](tutorials/applying_a_pipeline.ipynb) for pipelines.

**Long answer**: gating models and pipelines allows [me](https://herve.niderb.fr) to know a bit more about `pyannote.audio` user base and eventually help me write grant proposals to make `pyannote.audio` even better. So, please fill gating forms as precisely as possible.

For instance, before gating `pyannote/speaker-diarization`, I had no idea that so many people were relying on it in production. Hint: sponsors are more than welcome! Maintaining open source libraries is time consuming.

That being said, this whole authentication process does not prevent you from using official `pyannote.audio` models offline (i.e. without going through the authentication process in every `docker run ...` or whatever you are using in production): see [this tutorial](tutorials/applying_a_model.ipynb) for models and [that one](tutorials/applying_a_pipeline.ipynb) for pipelines.

<a name="does-pyannote-support-streaming-speaker-diarization"></a>
## Does pyannote support streaming speaker diarization?

pyannote does not, but [diart](https://github.com/juanmc2005/diart) (which is based on pyannote) does.

<a name="how-can-i-improve-performance"></a>
## How can I improve performance?

**Short answer:** [pyannoteAI](https://www.pyannote.ai) precision models are usually much more accurate (and faster).

**Long answer:**

1. Manually annotate dozens of conversations as precisely as possible.
2. Separate them into train (80%), development (10%) and test (10%) subsets.
3. Setup the data for use with [`pyannote.database`](https://github.com/pyannote/pyannote-database#speaker-diarization).
4. Follow [this recipe](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/adapting_pretrained_pipeline.ipynb).
5. Enjoy.

<a name="how-does-one-spell-and-pronounce-pyannoteaudio"></a>
## How does one spell and pronounce pyannote.audio?

📝 Written in lower case: `pyannote.audio` (or `pyannote` if you are lazy). Not `PyAnnote` nor `PyAnnotate` (sic).
📢 Pronounced like the french verb `pianoter`. `pi` like in `pi`ano, not `py` like in `py`thon.
🎹 `pianoter` means to play the piano (hence the logo 🤯).

---

# Source: https://github.com/pyannote/pyannote-audio

# pyannote.audio Documentation

pyannote.audio is an open-source Python toolkit for speaker diarization, providing neural building blocks for identifying who spoke when in audio files.

## Overview

pyannote.audio is built on PyTorch and PyTorch Lightning, offering:

- State-of-the-art speaker diarization models
- Pretrained pipelines available on Hugging Face
- Support for voice activity detection (VAD), speaker change detection, and overlapped speech detection
- Multi-GPU training with PyTorch Lightning
- Flexible pipeline-based architecture for composing tasks
- Premium pyannoteAI cloud API for production deployments

## Key Features

- **Speaker Diarization**: Identify who spoke when in audio files
- **Voice Activity Detection (VAD)**: Detect speech regions in audio
- **Speaker Change Detection**: Identify points where speakers change
- **Overlapped Speech Detection**: Detect when multiple speakers talk simultaneously
- **Speaker Embedding**: Generate speaker representations for verification
- **Speaker Verification**: Identify speakers using voiceprints
- **Pretrained Models**: Download pretrained models from Hugging Face
- **Training Support**: Train custom models on your own data
- **Multi-GPU Training**: Leverage multiple GPUs with PyTorch Lightning

## Installation

```bash
# Install with pip
pip install pyannote.audio

# Install with uv (recommended)
uv add pyannote.audio
```

### Requirements

- Python 3.8+
- PyTorch 1.12+
- FFmpeg (for audio decoding via torchcodec)
- CUDA/GPU (recommended for inference, required for training)

## Quick Start

### Community Open-Source Model

```python
from pyannote.audio import Pipeline

# Load pretrained community-1 pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    use_auth_token="HUGGINGFACE_TOKEN"
)

# Apply to audio file
output = pipeline("audio.wav")

# Print results
for turn, speaker in output.speaker_diarization:
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
```

### Premium pyannoteAI API

```python
from pyannote.audio import Pipeline

# Use premium precision-2 model via API
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-precision-2",
    token="PYANNOTEAI_API_KEY"
)

# Apply to audio file (runs on pyannoteAI servers)
output = pipeline("audio.wav")

for turn, speaker in output.speaker_diarization:
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s {speaker}")
```

## Available Models

### Speaker Diarization
- `pyannote/speaker-diarization-community-1` - Free, open-source model
- `pyannote/speaker-diarization-3.1` - Legacy v3.1 model
- `pyannote/speaker-diarization-precision-2` - Premium cloud-based API

### Voice Activity Detection
- `pyannote/voice-activity-detection` - Detect speech regions
- `pyannote/voice-activity-detection-v3` - Latest VAD model

### Speaker Segmentation
- Speaker segmentation models for custom pipelines
- Customizable for fine-tuning on domain-specific data

## Configuration Options

### Pipeline Parameters

```python
# Limit number of speakers
output = pipeline("audio.wav", num_speakers=2)

# Set speaker count bounds
output = pipeline("audio.wav", min_speakers=1, max_speakers=3)

# Use progress hook for long audio
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    output = pipeline("audio.wav", hook=hook)
```

### GPU Acceleration

```python
import torch
pipeline.to(torch.device("cuda"))  # Use GPU for inference
```

## Output Format

Speaker diarization output is a `DiarizationResult` object:

```python
for turn, speaker in output.speaker_diarization:
    print(f"Speaker {speaker}: {turn.start:.2f}s - {turn.end:.2f}s")
    print(f"Duration: {turn.duration:.2f}s")

# Export to RTTM format
with open("output.rttm", "w") as rttm:
    output.write_rttm(rttm)
```

## Training Custom Models

### Basic Training

```python
from pyannote.audio.models import SpeakerDiarization

# Define model
model = SpeakerDiarization(encoder="wav2vec2-large-xlsr-53-english")

# Train on your data
from pytorch_lightning import Trainer
trainer = Trainer(max_epochs=100, gpus=1)
trainer.fit(model, train_loader, val_loader)
```

### Fine-tuning Pretrained Models

```python
# Load pretrained model
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/speaker-diarization-3.1")

# Fine-tune on your data
trainer = Trainer(max_epochs=50, gpus=1)
trainer.finetune(model, train_loader, val_loader)
```

## API Reference

### Pipeline

- `Pipeline.from_pretrained(model_id, token=None)` - Load pretrained pipeline
- `pipeline(audio_path, num_speakers=None, min_speakers=None, max_speakers=None)` - Process audio file
- `pipeline.to(device)` - Move pipeline to device (GPU/CPU)

### Output

- `output.speaker_diarization` - Iterator over (turn, speaker) tuples
- `output.write_rttm(file_handle)` - Export results to RTTM format
- `output.to_dataframe()` - Convert to pandas DataFrame

### Models

- `Model.from_pretrained(model_id, token=None)` - Load pretrained model
- `Model.custom(architecture, pretrained=False)` - Create custom model

## Benchmarks

As of September 2025, performance on standard benchmarks:

| Dataset | Legacy (3.1) | Community-1 | Precision-2 |
|---------|-------------|------------|------------|
| AISHELL-4 | 12.2% | 11.7% | 11.4% |
| AMI (IHM) | 18.8% | 17.0% | 12.9% |
| VoxConverse | 11.2% | 11.2% | 8.5% |
| DIHARD 3 | 21.4% | 20.2% | 14.7% |

Diarization Error Rate (DER) in %, lower is better.

## Telemetry

pyannote.audio tracks usage for research purposes:

```python
from pyannote.audio.telemetry import set_telemetry_metrics

# Enable metrics (default)
set_telemetry_metrics(True, save_choice_as_default=True)

# Disable metrics
set_telemetry_metrics(False, save_choice_as_default=True)
```

## Resources

- **Official Website**: https://pyannote.ai
- **GitHub Repository**: https://github.com/pyannote/pyannote-audio
- **Documentation**: https://docs.pyannote.ai (pyannoteAI)
- **Hugging Face Models**: https://huggingface.co/pyannote
- **Paper**: See GitHub repository for academic publications
- **Issues & Questions**: GitHub Issues for bug reports and feature requests

## Related Projects

- **pyannote.metrics**: Evaluation metrics for diarization
- **pyannote.core**: Core data structures (Annotation, Timeline, etc.)
- **pyannoteAI**: Premium cloud-based speaker diarization service

## License

MIT License - See LICENSE file in repository

## Citation

If you use pyannote.audio in published research, please cite the appropriate paper(s) from the GitHub repository.

## Support

- **FAQ**: See FAQ.md
- **GitHub Issues**: https://github.com/pyannote/pyannote-audio/issues
- **Discussion**: https://github.com/pyannote/pyannote-audio/discussions
- **Documentation**: https://docs.pyannote.ai (for premium API)