# Pyannote
> - Can I apply pretrained pipelines on audio already loaded in memory?
---
# Frequently Asked Questions
- [Can I apply pretrained pipelines on audio already loaded in memory?](#can-i-apply-pretrained-pipelines-on-audio-already-loaded-in-memory)
- [Can I use gated models (and pipelines) offline?](#can-i-use-gated-models-(and-pipelines)-offline)
- [Does pyannote support streaming speaker diarization?](#does-pyannote-support-streaming-speaker-diarization)
- [How can I improve performance?](#how-can-i-improve-performance)
- [How does one spell and pronounce pyannote.audio?](#how-does-one-spell-and-pronounce-pyannoteaudio)
## Can I apply pretrained pipelines on audio already loaded in memory?
Yes: read [this tutorial](tutorials/applying_a_pipeline.ipynb) until the end.
## Can I use gated models (and pipelines) offline?
**Short answer**: yes, see [this tutorial](tutorials/applying_a_model.ipynb) for models and [that one](tutorials/applying_a_pipeline.ipynb) for pipelines.
**Long answer**: gating models and pipelines allows [me](https://herve.niderb.fr) to know a bit more about `pyannote.audio` user base and eventually help me write grant proposals to make `pyannote.audio` even better. So, please fill gating forms as precisely as possible.
For instance, before gating `pyannote/speaker-diarization`, I had no idea that so many people were relying on it in production. Hint: sponsors are more than welcome! Maintaining open source libraries is time consuming.
That being said, this whole authentication process does not prevent you from using official `pyannote.audio` models offline (i.e. without going through the authentication process in every `docker run ...` or whatever you are using in production): see [this tutorial](tutorials/applying_a_model.ipynb) for models and [that one](tutorials/applying_a_pipeline.ipynb) for pipelines.
## Does pyannote support streaming speaker diarization?
pyannote does not, but [diart](https://github.com/juanmc2005/diart) (which is based on pyannote) does.
## How can I improve performance?
**Short answer:** [pyannoteAI](https://www.pyannote.ai) precision models are usually much more accurate (and faster).
**Long answer:**
1. Manually annotate dozens of conversations as precisely as possible.
2. Separate them into train (80%), development (10%) and test (10%) subsets.
3. Setup the data for use with [`pyannote.database`](https://github.com/pyannote/pyannote-database#speaker-diarization).
4. Follow [this recipe](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/adapting_pretrained_pipeline.ipynb).
5. Enjoy.
## How does one spell and pronounce pyannote.audio?
📝 Written in lower case: `pyannote.audio` (or `pyannote` if you are lazy). Not `PyAnnote` nor `PyAnnotate` (sic).
📢 Pronounced like the french verb `pianoter`. `pi` like in `pi`ano, not `py` like in `py`thon.
🎹 `pianoter` means to play the piano (hence the logo 🤯).
---
# Source: https://github.com/pyannote/pyannote-audio
# pyannote.audio Documentation
pyannote.audio is an open-source Python toolkit for speaker diarization, providing neural building blocks for identifying who spoke when in audio files.
## Overview
pyannote.audio is built on PyTorch and PyTorch Lightning, offering:
- State-of-the-art speaker diarization models
- Pretrained pipelines available on Hugging Face
- Support for voice activity detection (VAD), speaker change detection, and overlapped speech detection
- Multi-GPU training with PyTorch Lightning
- Flexible pipeline-based architecture for composing tasks
- Premium pyannoteAI cloud API for production deployments
## Key Features
- **Speaker Diarization**: Identify who spoke when in audio files
- **Voice Activity Detection (VAD)**: Detect speech regions in audio
- **Speaker Change Detection**: Identify points where speakers change
- **Overlapped Speech Detection**: Detect when multiple speakers talk simultaneously
- **Speaker Embedding**: Generate speaker representations for verification
- **Speaker Verification**: Identify speakers using voiceprints
- **Pretrained Models**: Download pretrained models from Hugging Face
- **Training Support**: Train custom models on your own data
- **Multi-GPU Training**: Leverage multiple GPUs with PyTorch Lightning
## Installation
```bash
# Install with pip
pip install pyannote.audio
# Install with uv (recommended)
uv add pyannote.audio
```
### Requirements
- Python 3.8+
- PyTorch 1.12+
- FFmpeg (for audio decoding via torchcodec)
- CUDA/GPU (recommended for inference, required for training)
## Quick Start
### Community Open-Source Model
```python
from pyannote.audio import Pipeline
# Load pretrained community-1 pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
use_auth_token="HUGGINGFACE_TOKEN"
)
# Apply to audio file
output = pipeline("audio.wav")
# Print results
for turn, speaker in output.speaker_diarization:
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
```
### Premium pyannoteAI API
```python
from pyannote.audio import Pipeline
# Use premium precision-2 model via API
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-precision-2",
token="PYANNOTEAI_API_KEY"
)
# Apply to audio file (runs on pyannoteAI servers)
output = pipeline("audio.wav")
for turn, speaker in output.speaker_diarization:
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s {speaker}")
```
## Available Models
### Speaker Diarization
- `pyannote/speaker-diarization-community-1` - Free, open-source model
- `pyannote/speaker-diarization-3.1` - Legacy v3.1 model
- `pyannote/speaker-diarization-precision-2` - Premium cloud-based API
### Voice Activity Detection
- `pyannote/voice-activity-detection` - Detect speech regions
- `pyannote/voice-activity-detection-v3` - Latest VAD model
### Speaker Segmentation
- Speaker segmentation models for custom pipelines
- Customizable for fine-tuning on domain-specific data
## Configuration Options
### Pipeline Parameters
```python
# Limit number of speakers
output = pipeline("audio.wav", num_speakers=2)
# Set speaker count bounds
output = pipeline("audio.wav", min_speakers=1, max_speakers=3)
# Use progress hook for long audio
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
output = pipeline("audio.wav", hook=hook)
```
### GPU Acceleration
```python
import torch
pipeline.to(torch.device("cuda")) # Use GPU for inference
```
## Output Format
Speaker diarization output is a `DiarizationResult` object:
```python
for turn, speaker in output.speaker_diarization:
print(f"Speaker {speaker}: {turn.start:.2f}s - {turn.end:.2f}s")
print(f"Duration: {turn.duration:.2f}s")
# Export to RTTM format
with open("output.rttm", "w") as rttm:
output.write_rttm(rttm)
```
## Training Custom Models
### Basic Training
```python
from pyannote.audio.models import SpeakerDiarization
# Define model
model = SpeakerDiarization(encoder="wav2vec2-large-xlsr-53-english")
# Train on your data
from pytorch_lightning import Trainer
trainer = Trainer(max_epochs=100, gpus=1)
trainer.fit(model, train_loader, val_loader)
```
### Fine-tuning Pretrained Models
```python
# Load pretrained model
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/speaker-diarization-3.1")
# Fine-tune on your data
trainer = Trainer(max_epochs=50, gpus=1)
trainer.finetune(model, train_loader, val_loader)
```
## API Reference
### Pipeline
- `Pipeline.from_pretrained(model_id, token=None)` - Load pretrained pipeline
- `pipeline(audio_path, num_speakers=None, min_speakers=None, max_speakers=None)` - Process audio file
- `pipeline.to(device)` - Move pipeline to device (GPU/CPU)
### Output
- `output.speaker_diarization` - Iterator over (turn, speaker) tuples
- `output.write_rttm(file_handle)` - Export results to RTTM format
- `output.to_dataframe()` - Convert to pandas DataFrame
### Models
- `Model.from_pretrained(model_id, token=None)` - Load pretrained model
- `Model.custom(architecture, pretrained=False)` - Create custom model
## Benchmarks
As of September 2025, performance on standard benchmarks:
| Dataset | Legacy (3.1) | Community-1 | Precision-2 |
|---------|-------------|------------|------------|
| AISHELL-4 | 12.2% | 11.7% | 11.4% |
| AMI (IHM) | 18.8% | 17.0% | 12.9% |
| VoxConverse | 11.2% | 11.2% | 8.5% |
| DIHARD 3 | 21.4% | 20.2% | 14.7% |
Diarization Error Rate (DER) in %, lower is better.
## Telemetry
pyannote.audio tracks usage for research purposes:
```python
from pyannote.audio.telemetry import set_telemetry_metrics
# Enable metrics (default)
set_telemetry_metrics(True, save_choice_as_default=True)
# Disable metrics
set_telemetry_metrics(False, save_choice_as_default=True)
```
## Resources
- **Official Website**: https://pyannote.ai
- **GitHub Repository**: https://github.com/pyannote/pyannote-audio
- **Documentation**: https://docs.pyannote.ai (pyannoteAI)
- **Hugging Face Models**: https://huggingface.co/pyannote
- **Paper**: See GitHub repository for academic publications
- **Issues & Questions**: GitHub Issues for bug reports and feature requests
## Related Projects
- **pyannote.metrics**: Evaluation metrics for diarization
- **pyannote.core**: Core data structures (Annotation, Timeline, etc.)
- **pyannoteAI**: Premium cloud-based speaker diarization service
## License
MIT License - See LICENSE file in repository
## Citation
If you use pyannote.audio in published research, please cite the appropriate paper(s) from the GitHub repository.
## Support
- **FAQ**: See FAQ.md
- **GitHub Issues**: https://github.com/pyannote/pyannote-audio/issues
- **Discussion**: https://github.com/pyannote/pyannote-audio/discussions
- **Documentation**: https://docs.pyannote.ai (for premium API)