# Huggingface Transformers > We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnici ## Pages - [Contributor Covenant Code of Conduct](code-of-conduct.md): We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience f... - [Accelerate](accelerate.md): [Accelerate](https://hf.co/docs/accelerate/index) is a library designed to simplify distributed training on any type ... - [Accelerator selection](accelerator-selection.md): During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use.... - [Legacy model contribution](add-new-model.md): [!TIP] - [Adding a new pipeline](add-new-pipeline.md): Make [Pipeline](/docs/transformers/v4.57.3/en/main_classes/pipelines#transformers.Pipeline) your own by subclassing i... - [AFMoE](afmoe.md): AFMoE (Arcee Foundational Mixture of Experts) is a decoder-only transformer model that extends the Llama architecture... - [Agents](agents.md): (deprecated) - [AIMv2](aimv2.md): The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://huggingface... - [ALBERT](albert.md): [ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training ... - [ALIGN](align.md): [ALIGN](https://huggingface.co/papers/2102.05918) is pretrained on a noisy 1.8 billion alt‑text and image pair datase... - [AltCLIP](altclip.md): [AltCLIP](https://huggingface.co/papers/2211.06679) replaces the [CLIP](./clip) text encoder with a multilingual XLM-... - [Multimodal Generation](any-to-any.md): Multimodal (any-to-any) models are language models capable of processing diverse types of input data (e.g., text, ima... - [Apertus](apertus.md): [Apertus](https://www.swiss-ai.org) is a family of large language models from the Swiss AI Initiative. - [AQLM](aqlm.md): Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) quantizes multiple weight... - [Arcee](arcee.md): [Arcee](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model) is a decoder-only transforme... - [Aria](aria.md): [Aria](https://huggingface.co/papers/2410.05993) is a multimodal mixture-of-experts (MoE) model. The goal of this mod... - [Automatic speech recognition](asr.md): Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outpu... - [Attention Interface](attention-interface.md): This page describes how to use the`AttentionInterface`in order to register custom attention functions to use with - [Audio Spectrogram Transformer](audio-spectrogram-transformer.md): The Audio Spectrogram Transformer model was proposed in [AST: Audio Spectrogram Transformer](https://huggingface.co/p... - [Audio classification](audio-classification.md): Audio classification - just like with text - assigns a class label as output from the input data. The only difference... - [Utilities for `FeatureExtractors`](audio-utils.md): This page lists all the utility functions that can be used by the audio`FeatureExtractor`in order to compute specia... - [Audio Flamingo 3](audioflamingo3.md): Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning ove... - [Auto Classes](auto.md): In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you - [Documenting a model](auto-docstring.md): The`@auto_docstring`decorator in Transformers generates consistent docstrings for model classes and their methods. ... - [AutoRound](auto-round.md): [AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy,... - [Autoformer](autoformer.md): The Autoformer model was proposed in [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Seri... - [AWQ](awq.md): [Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) preserves a small fraction of the weigh... - [Axolotl](axolotl.md): [Axolotl](https://docs.axolotl.ai/) is a fine-tuning and post-training framework for large language models. It suppor... - [Aya Vision](aya-vision.md): [Aya Vision](https://huggingface.co/papers/2505.08751) is a family of open-weight multimodal vision-language models f... - [Backbones](backbones.md): Higher-level computer visions tasks, such as object detection or image segmentation, use several models together to g... - [Bamba](bamba.md): [Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mam... - [Bark](bark.md): [Bark](https://huggingface.co/suno/bark) is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/... - [BART](bart.md): [BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining object... - [BARThez](barthez.md): [BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unl... - [BARTpho](bartpho.md): [BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers... - [BEiT](beit.md): The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.082... - [BertGeneration](bert-generation.md): [BertGeneration](https://huggingface.co/papers/1907.12461) leverages pretrained BERT checkpoints for sequence-to-sequ... - [BertJapanese](bert-japanese.md): The BERT models trained on Japanese text. - [BERT](bert.md): [BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predi... - [BERTweet](bertweet.md): [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pr... - [BigBird](big-bird.md): [BigBird](https://huggingface.co/papers/2007.14062) is a transformer model built to handle sequence lengths up to 409... - [BigBirdPegasus](bigbird-pegasus.md): [BigBirdPegasus](https://huggingface.co/papers/2007.14062) is an encoder-decoder (sequence-to-sequence) transformer m... - [BioGPT](biogpt.md): [BioGPT](https://huggingface.co/papers/2210.10341) is a generative Transformer model based on [GPT-2](./gpt2) and pre... - [Big Transfer (BiT)](bit.md): The BiT model was proposed in [Big Transfer (BiT): General Visual Representation Learning](https://huggingface.co/pap... - [BitNet](bitnet.md): [BitNet](https://huggingface.co/papers/2402.17764) replaces traditional linear layers in Multi-Head Attention and fee... - [Bitsandbytes](bitsandbytes.md): The [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library provides quantization tools for L... - [Blenderbot Small](blenderbot-small.md): Note that [BlenderbotSmallModel](/docs/transformers/v5.0.0/en/model_doc/blenderbot-small#transformers.BlenderbotSmall... - [Blenderbot](blenderbot.md): The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://huggingface.co/papers... - [BLIP-2](blip-2.md): The BLIP-2 model was proposed in [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and La... - [BLIP](blip.md): [BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretr... - [BLOOM](bloom.md): The [BLOOM](https://huggingface.co/papers/2211.05100) model has been proposed with its various versions through the [... - [Byte Latent Transformer (BLT)](blt.md): The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/pape... - [BridgeTower](bridgetower.md): The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representati... - [BROS](bros.md): The BROS model was proposed in [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Informa... - [ByT5](byt5.md): [ByT5](https://huggingface.co/papers/2105.13626) is tokenizer-free version of the [T5](./t5) model designed to works ... - [Caching](cache-explanation.md): Imagine you're having a conversation with someone, and instead of remembering what they previously said, they have to... - [Callbacks](callback.md): Callbacks are objects that can customize the behavior of the training loop in the PyTorch - [CamemBERT](camembert.md): [CamemBERT](https://huggingface.co/papers/1911.03894) is a language model based on [RoBERTa](./roberta), but trained ... - [CANINE](canine.md): [CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of spl... - [Chameleon](chameleon.md): The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/pa... - [Tool use](chat-extras.md): Chat models are commonly trained with support for "function-calling" or "tool-use". Tools are functions supplied by t... - [Chat templates](chat-templating.md): The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using ... - [Multimodal chat templates](chat-templating-multimodal.md): Multimodal chat models accept inputs like images, audio or video, in addition to text. The`content`key in a multimo... - [Writing a chat template](chat-templating-writing.md): A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer... - [Chinese-CLIP](chinese-clip.md): The Chinese-CLIP model was proposed in [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://hug... - [CLAP](clap.md): [CLAP (Contrastive Language-Audio Pretraining)](https://huggingface.co/papers/2211.06687) is a multimodal model that ... - [CLIP](clip.md): [CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcomi... - [CLIPSeg](clipseg.md): The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://huggingface.co/papers/211... - [CLVP](clvp.md): The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through s... - [CodeLlama](code-llama.md): [Code Llama](https://huggingface.co/papers/2308.12950) is a specialized family of large language models based on [Lla... - [CodeGen](codegen.md): The CodeGen model was proposed in [A Conversational Paradigm for Program Synthesis](https://huggingface.co/papers/220... - [Cohere](cohere.md): Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed f... - [Cohere 2](cohere2.md): [Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion paramet... - [Command A Vision](cohere2-vision.md): Command A Vision ([blog post](https://cohere.com/blog/command-a-vision)) is a state-of-the-art multimodal model desig... - [ColPali](colpali.md): [ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visu... - [ColQwen2](colqwen2.md): [ColQwen2](https://huggingface.co/papers/2407.01449) is a variant of the [ColPali](./colpali) model designed to retri... - [Community](community.md): This page regroups resources around 🤗 Transformers developed by the community. - [compressed-tensors](compressed-tensors.md): [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) extends [safetensors](https://github.com/hugg... - [Quantization concepts](concept-guide.md): Quantization reduces the memory footprint and computational cost of large machine learning models like those found in... - [Conditional DETR](conditional-detr.md): The Conditional DETR model was proposed in [Conditional DETR for Fast Training Convergence](https://huggingface.co/pa... - [Configuration](configuration.md): The base class [PreTrainedConfig](/docs/transformers/v5.0.0/en/main_classes/configuration#transformers.PreTrainedConf... - [Continuous batching](continuous-batching.md): Continuous batching maximizes GPU utilization. It increases throughput and reduces latency by using dynamic schedulin... - [Contribute](contribute.md): Transformers supports many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are still many... - [Contribute to 🤗 Transformers](contributing.md): Everyone is welcome to contribute, and we value everybody's contribution. Code - [ConvBERT](convbert.md): The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface... - [Chat basics](conversations.md): Chat models are conversational models you can send a message to and receive a response. Most language models from mid... - [ConvNeXT](convnext.md): The ConvNeXT model was proposed in [A ConvNet for the 2020s](https://huggingface.co/papers/2201.03545) by Zhuang Liu,... - [ConvNeXt V2](convnextv2.md): The ConvNeXt V2 model was proposed in [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https... - [CPM](cpm.md): The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.... - [CPMAnt](cpmant.md): CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first mileston... - [Csm](csm.md): The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](h... - [CTRL](ctrl.md): CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://huggi... - [Using Cursor as a client of transformers serve](cursor.md): This example shows how to use`transformers serve`as a local LLM provider for [Cursor](https://cursor.com/), the pop... - [Customizing models](custom-models.md): Transformers models are designed to be customizable. A models code is fully contained in the [model](https://github.c... - [Convolutional Vision Transformer (CvT)](cvt.md): [Convolutional Vision Transformer (CvT)](https://huggingface.co/papers/2103.15808) is a model that combines the stren... - [Code World Model (CWM)](cwm.md): The Code World Model (CWM) model was proposed in [CWM: An Open-Weights LLM for Research on Code - [D-FINE](d-fine.md): The D-FINE model was proposed in [D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement](... - [DAB-DETR](dab-detr.md): The DAB-DETR model was proposed in [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://huggingface.c... - [DAC](dac.md): The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://hu... - [Data2Vec](data2vec.md): The Data2Vec model was proposed in [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and ... - [Data Collator](data-collator.md): Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of - [DBRX](dbrx.md): DBRX is a [transformer-based](https://www.isattentionallyouneed.com/) decoder-only large language model (LLM) that wa... - [DeBERTa-v2](deberta-v2.md): [DeBERTa-v2](https://huggingface.co/papers/2006.03654) improves on the original [DeBERTa](./deberta) architecture by ... - [DeBERTa](deberta.md): [DeBERTa](https://huggingface.co/papers/2006.03654) improves the pretraining efficiency of BERT and RoBERTa with two ... - [Multi-GPU debugging](debugging.md): Distributed training can be tricky because you have to ensure you're using the correct CUDA version across your syste... - [Decision Transformer](decision-transformer.md): The Decision Transformer model was proposed in [Decision Transformer: Reinforcement Learning via Sequence Modeling](h... - [DeepSeek-V2](deepseek-v2.md): The DeepSeek-V2 model was proposed in [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language M... - [DeepSeek-V3](deepseek-v3.md): The DeepSeek-V3 model was proposed in [DeepSeek-V3 Technical Report](https://huggingface.co/papers/2412.19437) by Dee... - [DeepseekVL](deepseek-vl.md): [Deepseek-VL](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-langua... - [DeepseekVLHybrid](deepseek-vl-hybrid.md): [Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision... - [DeepSpeed](deepspeed.md): [DeepSpeed](https://www.deepspeed.ai/) is designed to optimize distributed training for large models with data, model... - [Deformable DETR](deformable-detr.md): [Deformable DETR](https://huggingface.co/papers/2010.04159) improves on the original [DETR](./detr) by using a deform... - [DeiT](deit.md): The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://... - [DePlot](deplot.md): DePlot was proposed in the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://hu... - [Depth Anything](depth-anything.md): [Depth Anything](https://huggingface.co/papers/2401.10891) is designed to be a foundation model for monocular depth e... - [Depth Anything V2](depth-anything-v2.md): Depth Anything V2 was introduced in [the paper of the same name](https://huggingface.co/papers/2406.09414) by Lihe Ya... - [DepthPro](depth-pro.md): The DepthPro model was proposed in [Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://huggingfac... - [DETR](detr.md): [DETR](https://huggingface.co/papers/2005.12872) consists of a convolutional backbone followed by an encoder-decoder ... - [Dia](dia.md): [Dia](https://github.com/nari-labs/dia) is an open-source text-to-speech (TTS) model (1.6B parameters) developed by [... - [DialoGPT](dialogpt.md): DialoGPT was proposed in [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https... - [DiffLlama](diffllama.md): The DiffLlama model was proposed in [Differential Transformer](https://huggingface.co/papers/2410.05258) by Kazuma Ma... - [Dilated Neighborhood Attention Transformer](dinat.md): DiNAT was proposed in [Dilated Neighborhood Attention Transformer](https://huggingface.co/papers/2209.15001) - [DINOv2](dinov2.md): [DINOv2](https://huggingface.co/papers/2304.07193) is a vision foundation model that uses [ViT](./vit) as a feature e... - [DINOv2 with Registers](dinov2-with-registers.md): The DINOv2 with Registers model was proposed in [Vision Transformers Need Registers](https://huggingface.co/papers/23... - [DINOv3](dinov3.md): [DINOv3](https://huggingface.co/papers/2508.10104) is a family of versatile vision foundation models that outperforms... - [DistilBERT](distilbert.md): [DistilBERT](https://huggingface.co/papers/1910.01108) is pretrained by knowledge distillation to create a smaller mo... - [DiT](dit.md): [DiT](https://huggingface.co/papers/2203.02378) is an image transformer pretrained on large-scale unlabeled document ... - [Document Question Answering](document-question-answering.md): Document Question Answering, also referred to as Document Visual Question Answering, is a task that involves providing - [Doge](doge.md): Doge is a series of small language models based on the [Doge](https://github.com/SmallDoges/small-doge) architecture,... - [Donut](donut.md): [Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understan... - [dots.llm1](dots1.md): The`dots.llm1`model was proposed in [dots.llm1 technical report](https://huggingface.co/papers/2506.05767) by redno... - [DPR](dpr.md): Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was - [DPT](dpt.md): The DPT model was proposed in [Vision Transformers for Dense Prediction](https://huggingface.co/papers/2103.13413) by... - [EdgeTAM](edgetam.md): The EdgeTAM model was proposed in [EdgeTAM: On-Device Track Anything Model](https://huggingface.co/papers/2501.07256)... - [EdgeTAMVideo](edgetam-video.md): The EdgeTAM model was proposed in [EdgeTAM: On-Device Track Anything Model](https://huggingface.co/papers/2501.07256)... - [EETQ](eetq.md): The [Easy & Efficient Quantization for Transformers (EETQ)](https://github.com/NetEase-FuXi/EETQ) library supports in... - [EfficientLoFTR](efficientloftr.md): [EfficientLoFTR](https://huggingface.co/papers/2403.04765) is an efficient detector-free local feature matching metho... - [EfficientNet](efficientnet.md): The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](htt... - [ELECTRA](electra.md): [ELECTRA](https://huggingface.co/papers/2003.10555) modifies the pretraining objective of traditional masked language... - [Emu3](emu3.md): The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://huggingface.co/papers/2409.18869... - [EnCodec](encodec.md): The EnCodec neural codec model was proposed in [High Fidelity Neural Audio Compression](https://huggingface.co/papers... - [Encoder Decoder Models](encoder-decoder.md): [`EncoderDecoderModel`](https://huggingface.co/papers/1706.03762) initializes a sequence-to-sequence model with any p... - [Environment Variables](environment-variables.md): By default, this option is disabled. When enabled, it allows Torch and Safetensors weight files to be loaded in paral... - [EoMT](eomt.md): [The Encoder-only Mask Transformer]((https://www.tue-mps.org/eomt)) (EoMT) model was introduced in the CVPR 2025 High... - [ERNIE](ernie.md): [ERNIE1.0](https://huggingface.co/papers/1904.09223), [ERNIE2.0](https://ojs.aaai.org/index.php/AAAI/article/view/6428), - [Ernie 4.5](ernie4-5.md): The Ernie 4.5 model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) releas... - [Ernie 4.5 Moe](ernie4-5-moe.md): The Ernie 4.5 Moe model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) re... - [Ernie 4.5 VL MoE](ernie4-5-vl-moe.md): The Ernie 4.5 VL MoE model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/)... - [ESM](esm.md): This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental - [Evolla](evolla.md): The Evolla model was proposed in [Decoding the Molecular Language of Proteins with Evolla](https://doi.org/10.1101/20... - [EXAONE 4](exaone4.md): **[EXAONE 4.0](https://github.com/LG-AI-EXAONE/EXAONE-4.0)** model is the language model, which integrates a **Non-re... - [ExecuTorch](executorch.md): [ExecuTorch](https://pytorch.org/executorch/stable/index.html) is a platform that enables PyTorch training and infere... - [Falcon](falcon.md): [Falcon](https://huggingface.co/papers/2311.16867) is a family of large language models, available in 7B, 40B, and 18... - [Falcon3](falcon3.md): [Falcon3](https://falconllm.tii.ae/falcon3/index.html) represents a natural evolution from previous releases, emphasi... - [FalconH1](falcon-h1.md): The [FalconH1](https://huggingface.co/blog/tiiuae/falcon-h1) model was developed by the TII Pretraining team. A compr... - [FalconMamba](falcon-mamba.md): [FalconMamba](https://huggingface.co/papers/2410.05355) is a 7B large language model, available as pretrained and ins... - [Tokenizers](fast-tokenizers.md): Tokenizers convert text into an array of numbers known as tensors, the inputs to a text model. There are several toke... - [FastVLM](fast-vlm.md): FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging repara... - [FastSpeech2Conformer](fastspeech2-conformer.md): The FastSpeech2Conformer model was proposed with the paper [Recent Developments On Espnet Toolkit Boosted By Conforme... - [FBGEMM](fbgemm-fp8.md): [FBGEMM (Facebook GEneral Matrix Multiplication)](https://github.com/pytorch/FBGEMM) is a low-precision matrix multip... - [Feature Extractor](feature-extractor.md): A feature extractor is in charge of preparing input features for audio models. This includes feature extraction from ... - [Feature extractors](feature-extractors.md): Feature extractors preprocess audio data into the correct format for a given model. It takes the raw audio signal and... - [General Utilities](file-utils.md): This page lists all of Transformers general utility functions that are found in the file`utils.py`. - [Fine-grained FP8](finegrained-fp8.md): Fine-grained FP8 quantization quantizes the weights and activations to fp8. - [FLAN-T5](flan-t5.md): FLAN-T5 was released in the paper [Scaling Instruction-Finetuned Language Models](https://huggingface.co/papers/2210.... - [FLAN-UL2](flan-ul2.md): [Flan-UL2](https://www.yitay.net/blog/flan-ul2-20b) is an encoder decoder model based on the T5 architecture. It uses... - [FlauBERT](flaubert.md): The FlauBERT model was proposed in the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://... - [FLAVA](flava.md): The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://huggingface.co/pa... - [FlexOlmo](flex-olmo.md): [FlexOlmo](https://huggingface.co/papers/2507.07024) is a new class of language models (LMs) that supports (1) distri... - [Florence-2](florence2.md): [Florence-2](https://huggingface.co/papers/2311.06242) is an advanced vision foundation model that uses a prompt-base... - [FNet](fnet.md): The FNet model was proposed in [FNet: Mixing Tokens with Fourier Transforms](https://huggingface.co/papers/2105.03824... - [FocalNet](focalnet.md): The FocalNet model was proposed in [Focal Modulation Networks](https://huggingface.co/papers/2203.11926) by Jianwei Y... - [FP-Quant](fp-quant.md): [FP-Quant](https://github.com/IST-DASLab/FP-Quant) is a family of quantization algorithms tailored for the Blackwell ... - [FullyShardedDataParallel](fsdp.md): [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) i... - [FSMT](fsmt.md): FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR's WMT19 News Translation Task Submission](... - [Funnel Transformer](funnel.md): The Funnel Transformer model was proposed in the paper [Funnel-Transformer: Filtering out Sequential Redundancy for - [Fuyu](fuyu.md): The Fuyu model was created by [ADEPT](https://www.adept.ai/blog/fuyu-8b), and authored by Rohan Bavishi, Erich Elsen,... - [Gemma](gemma.md): [Gemma](https://huggingface.co/papers/2403.08295) is a family of lightweight language models with pretrained and inst... - [Gemma2](gemma2.md): [Gemma 2](https://huggingface.co/papers/2408.00118) is a family of language models with pretrained and instruction-tu... - [Gemma 3](gemma3.md): [Gemma 3](https://huggingface.co/papers/2503.19786) is a multimodal model with pretrained and instruction-tuned varia... - [Gemma3n](gemma3n.md): [Gemma3n](https://developers.googleblog.com/en/introducing-gemma-3n/) is a multimodal model with pretrained and instr... - [Generation features](generation-features.md): The [generate()](/docs/transformers/v4.57.3/en/main_classes/text_generation#transformers.GenerationMixin.generate) AP... - [Generation strategies](generation-strategies.md): A decoding strategy informs how a model should select the next generated token. There are many types of decoding stra... - [Utilities for Generation](generation-utils.md): This page lists all the utility functions used by [generate()](/docs/transformers/v5.0.0/en/main_classes/text_generat... - [GGUF](gguf.md): [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) is a file format used to store models for inferenc... - [GIT](git.md): The GIT model was proposed in [GIT: A Generative Image-to-text Transformer for Vision and Language](https://huggingfa... - [GLM-4](glm.md): The GLM Model was proposed - [GLM-4-0414](glm4.md): The GLM family welcomes new members [GLM-4-0414](https://huggingface.co/papers/2406.12793) series models. - [GLM-4.6V](glm46v.md): The GLM-V model was proposed in [GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable... - [Glm4Moe](glm4-moe.md): Both **GLM-4.6** and **GLM-4.5** language model use this class. The implementation in transformers does not include a... - [GLM-4.7-Flash](glm4-moe-lite.md): GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency. - [GLM-V](glm4v.md): The GLM-V model was proposed in [GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable... - [Glm4vMoe](glm4v-moe.md): The GLM-V model was proposed in [GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable... - [GlmAsr](glmasr.md): **GLM-ASR-Nano-2512** is a robust, open-source speech recognition model with **1.5B parameters**. Designed for - [Glossary](glossary.md): This glossary defines general machine learning and 🤗 Transformers terms to help you better understand the - [GLPN](glpn.md): This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight - [GOT-OCR2](got-ocr2.md): The GOT-OCR2 model was proposed in [General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model](https://huggi... - [GPT-Sw3](gpt-sw3.md): The GPT-Sw3 model was first proposed in - [GPT-2](gpt2.md): [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) is a s... - [GPTBigCode](gpt-bigcode.md): The GPTBigCode model was proposed in [SantaCoder: don't reach for the stars!](https://huggingface.co/papers/2301.0398... - [Gpt_Neo](gpt-neo.md): You can find all the original GPT-Neo checkpoints under the [EleutherAI](https://huggingface.co/EleutherAI?search_mod... - [GPT-NeoX](gpt-neox.md): We introduce [GPT-NeoX-20B](https://huggingface.co/papers/2204.06745), a 20 billion parameter autoregressive language... - [GPT-NeoX-Japanese](gpt-neox-japanese.md): GPT-NeoX-Japanese, a Japanese language model based on [GPT-NeoX](./gpt_neox). - [GptOss](gpt-oss.md): The GptOss model was proposed in [blog post](https://openai.com/index/introducing-gpt-oss/) by . - [GPT-J](gptj.md): The [GPT-J](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) model was released in the [kingoflolz/mesh-trans... - [GPTQ](gptq.md): The [GPT-QModel](https://github.com/ModelCloud/GPTQModel) project (Python package`gptqmodel`) implements the GPTQ al... - [Granite](granite.md): [Granite](https://huggingface.co/papers/2408.13359) is a 3B parameter language model trained with the Power scheduler... - [Granite Speech](granite-speech.md): The [Granite Speech](https://huggingface.co/papers/2505.08699) model ([blog post](https://www.ibm.com/new/announcemen... - [GraniteMoe](granitemoe.md): The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler... - [GraniteMoeHybrid](granitemoehybrid.md): The [GraniteMoeHybrid](https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek) model builds on... - [GraniteMoeShared](granitemoeshared.md): The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler... - [Granite Vision](granitevision.md): The [Granite Vision](https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more... - [Grounding DINO](grounding-dino.md): The Grounding DINO model was proposed in [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Objec... - [GroupViT](groupvit.md): The GroupViT model was proposed in [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://huggingfac... - [Helium](helium.md): Helium was proposed in [Announcing Helium-1 Preview](https://kyutai.org/2025/01/13/helium.html) by the Kyutai Team. - [HerBERT](herbert.md): The HerBERT model was proposed in [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://huggingfa... - [HGNet-V2](hgnet-v2.md): [HGNetV2](https://github.com/PaddlePaddle/PaddleClas/blob/v2.6.0/docs/zh_CN/models/ImageNet1k/PP-HGNetV2.md) is a nex... - [Hiera](hiera.md): Hiera was proposed in [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://huggingface.c... - [HIGGS](higgs.md): [HIGGS](https://huggingface.co/papers/2411.17525) is a zero-shot quantization algorithm that combines Hadamard prepro... - [Customizing model components](how-to-hack-models.md): Another way to customize a model is to modify their components, rather than writing a new model entirely, allowing yo... - [How to Fine-Tune GPT-J - The Basics](howto-finetune.md): Before anything else, you'll likely want to apply for access to the TPU Research Cloud (TRC). Combined with a Google ... - [Hyperparameter search](hpo-train.md): Hyperparameter search discovers an optimal set of hyperparameters that produces the best model performance. [Trainer]... - [HQQ](hqq.md): [Half-Quadratic Quantization (HQQ)](https://github.com/mobiusml/hqq/) supports fast on-the-fly quantization for 8, 4,... - [HuBERT](hubert.md): [HuBERT](https://huggingface.co/papers/2106.07447) is a self-supervised speech model to cluster aligned target labels... - [HunYuanDenseV1](hunyuan-v1-dense.md): To be released with the official model launch. - [HunYuanMoEV1](hunyuan-v1-moe.md): To be released with the official model launch. - [I-BERT](ibert.md): The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://huggingface.co/papers/2101.01321) by - [Image tasks with IDEFICS](idefics.md): While individual tasks can be tackled by fine-tuning specialized models, an alternative approach - [Idefics2](idefics2.md): The Idefics2 model was proposed in [What matters when building vision-language models?](https://huggingface.co/papers... - [Idefics3](idefics3.md): The Idefics3 model was proposed in [Building and better understanding vision-language models: insights and future dir... - [I-JEPA](ijepa.md): [I-JEPA](https://huggingface.co/papers/2301.08243) is a self-supervised learning method that learns semantic image re... - [Image captioning](image-captioning.md): Image captioning is the task of predicting a caption for a given image. Common real world applications of it include - [Image classification](image-classification.md): Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the - [Image Feature Extraction](image-feature-extraction.md): Image feature extraction is the task of extracting semantically meaningful features given an image. This has many use... - [Utilities for Image Processors](image-processing-utils.md): This page lists all the utility functions used by the image processors, mainly the functional - [Image Processor](image-processor.md): An image processor is in charge of loading images (optionally), preparing input features for vision models and post p... - [Image processors](image-processors.md): Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values ar... - [Image-text-to-text](image-text-to-text.md): Image-text-to-text models, also known as vision language models (VLMs), are language models that take an image input.... - [Image-to-Image Task Guide](image-to-image.md): Image-to-Image task is the task where an application receives an image and outputs another image. This has various su... - [ImageGPT](imagegpt.md): The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark - [Import Utilities](import-utils.md): This page goes through the transformers utilities to enable lazy and fast object import. - [Transformers](index.md): Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer - [Informer](informer.md): The Informer model was proposed in [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]... - [Installation](installation.md): Transformers works with [PyTorch](https://pytorch.org/get-started/locally/). It has been tested on Python 3.9+ and Py... - [InstructBLIP](instructblip.md): The InstructBLIP model was proposed in [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction... - [InstructBlipVideo](instructblipvideo.md): The InstructBLIPVideo is an extension of the models proposed in [InstructBLIP: Towards General-purpose Vision-Languag... - [InternVL](internvl.md): The InternVL3 family of Visual Language Models was introduced in [InternVL3: Exploring Advanced Training and Test-Tim... - [Jais2](jais2.md): Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the gr... - [Jamba](jamba.md): [Jamba](https://huggingface.co/papers/2403.19887) is a hybrid Transformer-Mamba mixture-of-experts (MoE) language mod... - [Jan: using the serving API as a local LLM provider](jan.md): This example shows how to use`transformers serve`as a local LLM provider for the [Jan](https://jan.ai/) app. Jan is... - [Janus](janus.md): The Janus Model was originally proposed in [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding an... - [JetMoe](jetmoe.md): **JetMoe-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.... - [Kernels](kernels.md): This page documents the kernels configuration utilities. - [Keypoint Detection](keypoint-detection.md): Keypoint detection identifies and locates specific points of interest within an image. These keypoints, also known as... - [Keypoint matching](keypoint-matching.md): Keypoint matching matches different points of interests that belong to same object appearing in two different images.... - [Knowledge Distillation for Computer Vision](knowledge-distillation-for-image-classification.md): Knowledge distillation is a technique used to transfer knowledge from a larger, more complex model (teacher) to a sma... - [KOSMOS-2](kosmos-2.md): The KOSMOS-2 model was proposed in [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://huggin... - [KOSMOS-2.5](kosmos2-5.md): The Kosmos-2.5 model was proposed in [KOSMOS-2.5: A Multimodal Literate Model](https://huggingface.co/papers/2309.114... - [KV cache strategies](kv-cache.md): The key-value (KV) vectors are used to calculate attention scores. For autoregressive models, KV scores are calculate... - [Kyutai Speech-To-Text](kyutai-speech-to-text.md): [Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://hu... - [Causal language modeling](language-modeling.md): There are two types of language modeling, causal and masked. This guide illustrates causal language modeling. - [LASR](lasr.md): TODO - [LayoutLM](layoutlm.md): [LayoutLM](https://huggingface.co/papers/1912.13318) jointly learns text and the document layout rather than focusing... - [LayoutLMV2](layoutlmv2.md): The LayoutLMV2 model was proposed in [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](... - [LayoutLMv3](layoutlmv3.md): The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](h... - [LayoutXLM](layoutxlm.md): LayoutXLM was proposed in [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](... - [LED](led.md): [Longformer-Encoder-Decoder (LED)](https://huggingface.co/papers/2004.05150) is an encoder-decoder transformer model... - [LeViT](levit.md): The LeViT model was proposed in [LeViT: Introducing Convolutions to Vision Transformers](https://huggingface.co/paper... - [LFM2](lfm2.md): [LFM2](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models) represents a... - [Lfm2Moe](lfm2-moe.md): LFM2-MoE is a Mixture-of-Experts (MoE) variant of [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d7219270... - [LFM2-VL](lfm2-vl.md): [LFM2-VL](https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models) first series of vision-language founda... - [LightGlue](lightglue.md): [LightGlue](https://huggingface.co/papers/2306.13643) is a deep neural network that learns to match local features ac... - [LightOnOcr](lighton-ocr.md): **LightOnOcr** is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document un... - [LiLT](lilt.md): The LiLT model was proposed in [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured D... - [Llama](llama.md): [Llama](https://huggingface.co/papers/2302.13971) is a family of large language models ranging from 7B to 65B paramet... - [Llama 2](llama2.md): [Llama 2](https://huggingface.co/papers/2307.09288) is a family of large language models, Llama 2 and Llama 2-Chat, a... - [Llama3](llama3.md): import transformers - [Llama4](llama4.md): [Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/), developed by Meta, introduces a new auto-regres... - [llama.cpp](llama-cpp.md): [llama.cpp](https://github.com/ggml-org/llama.cpp) is a C/C++ inference engine for deploying large language models lo... - [LLaVa](llava.md): LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following... - [LLaVA-NeXT](llava-next.md): [LLaVA‑NeXT](https://llava-vl.github.io/blog/2024-01-30-llava-next/) improves on [Llava](./llava) by increasing the i... - [LLaVa-NeXT-Video](llava-next-video.md): The LLaVa-NeXT-Video model was proposed in [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-v... - [LLaVA-OneVision](llava-onevision.md): The LLaVA-OneVision model was proposed in [LLaVA-OneVision: Easy Visual Task Transfer](https://huggingface.co/papers/... - [Optimizing inference](llm-optims.md): Inference with large language models (LLMs) can be challenging because they have to store and handle billions of para... - [Text generation](llm-tutorial.md): Text generation is the most popular application for large language models (LLMs). A LLM is trained to generate the ne... - [Optimizing LLMs for Speed and Memory](llm-tutorial-optimization.md): Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://... - [Loading kernels](loading-kernels.md): A kernel works as a drop-in replacement for standard PyTorch operations. It swaps the`forward`method with the optim... - [Logging](logging.md): 🤗 Transformers has a centralized logging system, so that you can setup the verbosity of the library easily. - [LongCatFlash](longcat-flash.md): The LongCatFlash model was proposed in [LongCat-Flash Technical Report](https://huggingface.co/papers/2509.01322) by ... - [Longformer](longformer.md): [Longformer](https://huggingface.co/papers/2004.05150) is a transformer model designed for processing long documents.... - [LongT5](longt5.md): The LongT5 model was proposed in [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://huggingface.... - [LUKE](luke.md): The LUKE model was proposed in [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](ht... - [LW-DETR](lw-detr.md): [LW-DETR](https://huggingface.co/papers/2407.17140) proposes a light-weight Detection Transformer (DETR) architecture... - [LXMERT](lxmert.md): The LXMERT model was proposed in [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://... - [M2M100](m2m-100.md): The M2M100 model was proposed in [Beyond English-Centric Multilingual Machine Translation](https://huggingface.co/pap... - [MADLAD-400](madlad-400.md): MADLAD-400 models were released in the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](ht... - [Mamba](mamba.md): [Mamba](https://huggingface.co/papers/2312.00752) is a selective structured state space model (SSMs) designed to work... - [Mamba 2](mamba2.md): [Mamba 2](https://huggingface.co/papers/2405.21060) is based on the state space duality (SSD) framework which connect... - [MarianMT](marian.md): [MarianMT](https://huggingface.co/papers/1804.00344) is a machine translation model trained with the Marian framework... - [MarkupLM](markuplm.md): The MarkupLM model was proposed in [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document - [Mask2Former](mask2former.md): The Mask2Former model was proposed in [Masked-attention Mask Transformer for Universal Image Segmentation](https://hu... - [Mask Generation](mask-generation.md): Mask generation is the task of generating semantically meaningful masks for an image. - [Masked language modeling](masked-language-modeling.md): Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This - [MaskFormer](maskformer.md): This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight - [MatCha](matcha.md): MatCha has been proposed in the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart De... - [mBART](mbart.md): [mBART](https://huggingface.co/papers/2001.08210) is a multilingual machine translation model that pretrains the enti... - [MegatronBERT](megatron-bert.md): The MegatronBERT model was proposed in [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model - [MegatronGPT2](megatron-gpt2.md): The MegatronGPT2 model was proposed in [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model - [MetaCLIP 2](metaclip-2.md): MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA)... - [MGP-STR](mgp-str.md): The MGP-STR model was proposed in [Multi-Granularity Prediction for Scene Text Recognition](https://huggingface.co/pa... - [Mimi](mimi.md): [Mimi](huggingface.co/papers/2410.00037) is a neural audio codec model with pretrained and quantized variants, design... - [MiniMax](minimax.md): [MiniMax-M2](https://huggingface.co/docs/transformers/en/model_doc/minimax_m2) was released on 2025‑10‑27. We recomme... - [MiniMax-M2](minimax-m2.md): MiniMax-M2 is a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active para... - [Ministral](ministral.md): [Ministral](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410) is a 8B parameter language model that extend... - [Ministral3](ministral3.md): A balanced model in the Ministral 3 family, Ministral 3 8B is a powerful, efficient tiny language model with vision c... - [Mistral](mistral.md): [Mistral](https://huggingface.co/papers/2310.06825) is a 7B parameter language model, available as a pretrained and i... - [Mistral 3](mistral3.md): [Mistral 3](https://mistral.ai/news/mistral-small-3) is a latency optimized model with a lot fewer layers to reduce t... - [Mixtral](mixtral.md): [Mixtral-8x7B](https://huggingface.co/papers/2401.04088) was introduced in the [Mixtral of Experts blogpost](https://... - [MLCD](mlcd.md): The [MLCD](https://huggingface.co/papers/2407.17331) models were released by the DeepGlint-AI team in [unicom](https:... - [Mllama](mllama.md): The [Llama 3.2-Vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) collection of mul... - [mLUKE](mluke.md): The mLUKE model was proposed in [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Model... - [MM Grounding DINO](mm-grounding-dino.md): [MM Grounding DINO](https://huggingface.co/papers/2401.02361) model was proposed in [An Open and Comprehensive Pipeli... - [MMS](mms.md): The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://huggingface.co/papers/2305.13516) - [MobileBERT](mobilebert.md): [MobileBERT](https://huggingface.co/papers/2004.02984) is a lightweight and efficient variant of BERT, specifically d... - [MobilenetV2 and above](mobilenet-v1.md): For MobilenetV2+ see this file [mobilenet/README.md](https://github.com/tensorflow/models/blob/master/research/slim/n... - [MobileNet V2](mobilenet-v2.md): [MobileNet V2](https://huggingface.co/papers/1801.04381) improves performance on mobile devices with a more efficient... - [MobileViT](mobilevit.md): [MobileViT](https://huggingface.co/papers/2110.02178) is a lightweight vision transformer for mobile devices that mer... - [MobileViTV2](mobilevitv2.md): The MobileViTV2 model was proposed in [Separable Self-attention for Mobile Vision Transformers](https://huggingface.c... - [Models](model.md): The base class [PreTrainedModel](/docs/transformers/v5.0.0rc1/en/main_classes/model#transformers.PreTrainedModel) imp... - [Model debugging toolboxes](model-debugging-utils.md): This page lists all the debugging and model adding tools used by the library, as well as the utility functions it - [Model training anatomy](model-memory-anatomy.md): To understand performance optimization techniques that one can apply to improve efficiency of model training - [Sharing](model-sharing.md): The Hugging Face [Hub](https://hf.co/models) is a platform for sharing, discovering, and consuming models of all diff... - [Custom Layers and Utilities](modeling-utils.md): This page lists all the custom layers used by the library, as well as the utility functions and classes it provides f... - [Loading models](models.md): Transformers provides many pretrained models that are ready to use with a single line of code. It requires a model cl... - [Models Timeline](models-timeline.md): The [Models Timeline](https://huggingface.co/spaces/yonigozlan/Transformers-Timeline) is an interactive chart of how ... - [ModernBERT Decoder](modernbert-decoder.md): ModernBERT Decoder has the same architecture as [ModernBERT](https://huggingface.co/papers/2412.13663) but it is trai... - [ModernBERT](modernbert.md): [ModernBERT](https://huggingface.co/papers/2412.13663) is a modernized version of`BERT`trained on 2T tokens. It bri... - [Contributing a new model to Transformers](modular-transformers.md): Modular Transformers lowers the bar for contributing models and significantly reduces the code required to add a mode... - [Monocular depth estimation](monocular-depth-estimation.md): Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a - [Moonshine](moonshine.md): [Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for re... - [Moshi](moshi.md): The Moshi model was proposed in [Moshi: a speech-text foundation model for real-time dialogue](https://huggingface.co... - [MPNet](mpnet.md): The MPNet model was proposed in [MPNet: Masked and Permuted Pre-training for Language Understanding](https://huggingf... - [MPT](mpt.md): The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and fin... - [MRA](mra.md): The MRA model was proposed in [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://huggingface.co... - [mT5](mt5.md): [mT5](https://huggingface.co/papers/2010.11934) is a multilingual variant of [T5](./t5), training on 101 languages. I... - [Multiple choice](multiple-choice.md): A multiple choice task is similar to question answering, except several candidate answers are provided along with a c... - [MusicGen](musicgen.md): The MusicGen model was proposed in the paper [Simple and Controllable Music Generation](https://huggingface.co/papers... - [MusicGen Melody](musicgen-melody.md): The MusicGen Melody model was proposed in [Simple and Controllable Music Generation](https://huggingface.co/papers/23... - [MVP](mvp.md): The MVP model was proposed in [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://huggi... - [MXFP4](mxfp4.md): Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b. - [myt5](myt5.md): The myt5 model was proposed in [MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Mod... - [NanoChat](nanochat.md): [NanoChat](https://huggingface.co/karpathy/nanochat-d32) is a compact decoder-only transformer model designed for edu... - [Nanotron](nanotron.md): [Nanotron](https://github.com/huggingface/nanotron) is a distributed training framework with tensor, parallel, and da... - [Nemotron](nemotron.md): Minitron is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/n... - [NLLB-MOE](nllb-moe.md): The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://hugging... - [NLLB](nllb.md): [NLLB: No Language Left Behind](https://huggingface.co/papers/2207.04672) is a multilingual translation model. It's t... - [🤗 Transformers Notebooks](notebooks.md): You can find here a list of the official notebooks provided by Hugging Face. - [Nougat](nougat.md): The Nougat model was proposed in [Nougat: Neural Optical Understanding for Academic Documents](https://huggingface.co... - [Nyströmformer](nystromformer.md): The Nyströmformer model was proposed in [*Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention*]... - [Object detection](object-detection.md): Object detection is the computer vision task of detecting instances (such as humans, buildings, or cars) in an image.... - [OLMo](olmo.md): [OLMo](https://huggingface.co/papers/2402.00838) is a 7B-parameter dense language model. It uses SwiGLU activations, ... - [OLMo2](olmo2.md): [OLMo2](https://huggingface.co/papers/2501.00656) improves on [OLMo](./olmo) by changing the architecture and trainin... - [OLMo3](olmo3.md): Olmo3 is an improvement on [OLMo2](./olmo2). More details will be released on *soon*. - [OLMoE](olmoe.md): [OLMoE](https://huggingface.co/papers/2409.02060) is a sparse Mixture-of-Experts (MoE) language model with 7B paramet... - [OmDet-Turbo](omdet-turbo.md): The OmDet-Turbo model was proposed in [Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion He... - [OneFormer](oneformer.md): The OneFormer model was proposed in [OneFormer: One Transformer to Rule Universal Image Segmentation](https://hugging... - [Audio transcriptions with WebUI and `transformers serve`](open-webui.md): This guide shows how to do audio transcription for chat purposes, using`transformers serve`and [Open WebUI](https:/... - [GPT](openai-gpt.md): [GPT (Generative Pre-trained Transformer)](https://cdn.openai.com/research-covers/language-unsupervised/language_unde... - [OPT](opt.md): [OPT](https://huggingface.co/papers/2205.01068) is a suite of open-source decoder-only pre-trained transformers whose... - [Overview](optimization-overview.md): Transformers provides multiple inference optimization techniques to make models fast, affordable, and accessible. Opt... - [Optimization](optimizer-schedules.md): The`.optimization`module provides: - [Optimizers](optimizers.md): Transformers offers two native optimizers, AdamW and AdaFactor. It also provides integrations for more specialized op... - [Optimum](optimum.md): [Optimum](https://huggingface.co/docs/optimum/index) is an optimization library that supports quantization for Intel,... - [Model outputs](output.md): All models have outputs that are instances of subclasses of [ModelOutput](/docs/transformers/v5.0.0rc1/en/main_classe... - [Overview](overview.md): Quantization lowers the memory requirements of loading and using a model by storing the weights in a lower precision ... - [Ovis2](ovis2.md): The [Ovis2](https://github.com/AIDC-AI/Ovis) is an updated version of the [Ovis](https://huggingface.co/papers/2405.2... - [OWLv2](owlv2.md): OWLv2 was proposed in [Scaling Open-Vocabulary Object Detection](https://huggingface.co/papers/2306.09683) by Matthia... - [OWL-ViT](owlvit.md): The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in [Simple Open-Vocabulary Object... - [Padding and truncation](pad-truncation.md): Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation ... - [PaddleOCR-VL](paddleocr-vl.md): **Huggingface Hub**: [PaddleOCR-VL](https://huggingface.co/collections/PaddlePaddle/paddleocr-vl) | **Github Repo**: ... - [PaliGemma](paligemma.md): [PaliGemma](https://huggingface.co/papers/2407.07726) is a family of vision-language models (VLMs), combining [SigLIP... - [Parakeet](parakeet.md): Parakeet models, [introduced by NVIDIA NeMo](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recog... - [PatchTSMixer](patchtsmixer.md): The PatchTSMixer model was proposed in [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting... - [PatchTST](patchtst.md): The PatchTST model was proposed in [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https:/... - [PE Audio (Perception Encoder Audio)](pe-audio.md): PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (... - [PE Audio Video (Perception Encoder Audio-Video)](pe-audio-video.md): TODO - [PEFT](peft.md): [PEFT](https://huggingface.co/docs/peft/index), a library of parameter-efficient fine-tuning methods, enables trainin... - [Pegasus](pegasus.md): [Pegasus](https://huggingface.co/papers/1912.08777) is an encoder-decoder (sequence-to-sequence) transformer model pr... - [PEGASUS-X](pegasus-x.md): [PEGASUS-X](https://huggingface.co/papers/2208.04347) is an encoder-decoder (sequence-to-sequence) transformer model ... - [Perceiver](perceiver.md): The Perceiver IO model was proposed in [Perceiver IO: A General Architecture for Structured Inputs & - [PerceptionLM](perception-lm.md): The [PerceptionLM](https://huggingface.co/papers/2504.13180) model was proposed in [PerceptionLM: Open-Access Data an... - [Build your own machine](perf-hardware.md): One of the most important considerations when building a machine for deep learning is the GPU choice. GPUs are the st... - [CPU](perf-infer-cpu.md): CPUs are a viable and cost-effective inference option. With a few optimization methods, it is possible to achieve goo... - [Distributed inference](perf-infer-gpu-multi.md): When a model doesn't fit on a single GPU, distributed inference with [tensor parallelism](./perf_train_gpu_many#tenso... - [GPU](perf-infer-gpu-one.md): GPUs are the standard hardware for machine learning because they're optimized for memory bandwidth and parallelism. W... - [torch.compile](perf-torch-compile.md): [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) compiles PyTorch code into op... - [CPU](perf-train-cpu.md): A modern CPU is capable of efficiently training large models by leveraging the underlying optimizations built into th... - [Distributed CPUs](perf-train-cpu-many.md): CPUs are commonly available and can be a cost-effective training option when GPUs are unavailable. When training larg... - [Intel Gaudi](perf-train-gaudi.md): The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai/products/gaudi/), [Intel Gaudi 2](ht... - [Parallelism methods](perf-train-gpu-many.md): Multi-GPU setups are effective for accelerating training and fitting large models in memory that otherwise wouldn't f... - [GPU](perf-train-gpu-one.md): GPUs are commonly used to train deep learning models due to their high memory bandwidth and parallel processing capab... - [Apple Silicon](perf-train-special.md): Apple Silicon (M series) features a unified memory architecture, making it possible to efficiently train large models... - [Perplexity of fixed-length models](perplexity.md): Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note - [Persimmon](persimmon.md): The Persimmon model was created by [ADEPT](https://www.adept.ai/blog/persimmon-8b), and authored by Erich Elsen, Augu... - [Phi](phi.md): [Phi](https://huggingface.co/papers/2306.11644) is a 1.3B parameter transformer model optimized for Python code gener... - [Phi-3](phi3.md): The Phi-3 model was proposed in [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https... - [Phi4_Multimodal](phi4-multimodal.md): [Phi4 Multimodal](https://huggingface.co/papers/2503.01743) is a multimodal model capable of text, image, and speech ... - [Philosophy](philosophy.md): 🤗 Transformers is an opinionated library built for: - [PhiMoE](phimoe.md): The PhiMoE model was proposed in [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](http... - [PhoBERT](phobert.md): The PhoBERT model was proposed in [PhoBERT: Pre-trained language models for Vietnamese](https://huggingface.co/papers... - [Machine learning apps](pipeline-gradio.md): [Gradio](https://www.gradio.app/), a fast and easy library for building and sharing machine learning apps, is integra... - [Pipeline](pipeline-tutorial.md): The [Pipeline](/docs/transformers/v4.57.3/en/main_classes/pipelines#transformers.Pipeline) is a simple but powerful i... - [Web server inference](pipeline-webserver.md): A web server is a system that waits for requests and serves them as they come in. This means you can use [Pipeline](/... - [Pipelines](pipelines.md): The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of - [Utilities for pipelines](pipelines-utils.md): This page lists all the utility functions the library provides for pipelines. - [Pix2Struct](pix2struct.md): The Pix2Struct model was proposed in [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding... - [Pixio](pixio.md): Pixio is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks li... - [Pixtral](pixtral.md): [Pixtral](https://huggingface.co/papers/2410.07073) is a multimodal model trained to understand natural images and do... - [PLBart](plbart.md): The PLBART model was proposed in [Unified Pre-training for Program Understanding and Generation](https://huggingface.... - [PoolFormer](poolformer.md): The PoolFormer model was proposed in [MetaFormer is Actually What You Need for Vision](https://huggingface.co/papers/... - [Pop2Piano](pop2piano.md): The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://huggingface.co/paper... - [Checks on a Pull Request](pr-checks.md): When you open a pull request on 🤗 Transformers, a fair number of checks will be run to make sure the patch you are ad... - [Processors](processors.md): Processors can mean two different things in the Transformers library: - [Prompt Depth Anything](prompt-depth-anything.md): The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth E... - [Prompt engineering](prompting.md): Prompt engineering or prompting, uses natural language to improve large language model (LLM) performance on a variety... - [ProphetNet](prophetnet.md): The ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](ht... - [Pyramid Vision Transformer (PVT)](pvt.md): The PVT model was proposed in - [Pyramid Vision Transformer V2 (PVTv2)](pvt-v2.md): The PVTv2 model was proposed in - [Quantization](quantization.md): Quantization techniques reduce memory and computational costs by representing weights and activations with lower-prec... - [Optimum Quanto](quanto.md): [Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggi... - [Quark](quark.md): [Quark](https://quark.docs.amd.com/latest/) is a deep learning quantization toolkit designed to be agnostic to specif... - [Question answering](question-answering.md): Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri... - [Quickstart](quicktour.md): Transformers is designed to be fast and easy to use so that everyone can start learning or building with transformer ... - [Qwen2](qwen2.md): [Qwen2](https://huggingface.co/papers/2407.10671) is a family of large language models (pretrained, instruction-tuned... - [Qwen2.5-Omni](qwen2-5-omni.md): The [Qwen2.5-Omni](https://qwenlm.github.io/blog/qwen2.5-omni/) model is a unified multiple modalities model proposed... - [Qwen2.5-VL](qwen2-5-vl.md): [Qwen2.5-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model, available in 3B, 7B, an... - [Qwen2Audio](qwen2-audio.md): The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of ... - [Qwen2MoE](qwen2-moe.md): [Qwen2MoE](https://huggingface.co/papers/2407.10671) is a Mixture-of-Experts (MoE) variant of [Qwen2](./qwen2), avail... - [Qwen2-VL](qwen2-vl.md): The [Qwen2-VL](https://huggingface.co/papers/2409.12191) ([blog post](https://qwenlm.github.io/blog/qwen2-vl/)) model... - [Qwen3](qwen3.md): [Qwen3](https://huggingface.co/papers/2505.09388) refers to the dense model architecture Qwen3-32B which was released... - [Qwen3MoE](qwen3-moe.md): [Qwen3MoE](https://huggingface.co/papers/2505.09388) refers to the mixture of experts model architecture Qwen3-235B-A... - [load the tokenizer and the model](qwen3-next.md): The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and larg... - [Qwen3-Omni-MOE](qwen3-omni-moe.md): The Qwen3-Omni-MOE model is a unified multiple modalities model proposed in [Qwen3-Omni Technical Report](https://hug... - [Qwen3-VL](qwen3-vl.md): [Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both ... - [Qwen3-VL-Moe](qwen3-vl-moe.md): [Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both ... - [RAG](rag.md): [Retrieval-Augmented Generation (RAG)](https://huggingface.co/papers/2005.11401) combines a pretrained language model... - [RecurrentGemma](recurrent-gemma.md): The Recurrent Gemma model was proposed in [RecurrentGemma: Moving Past Transformers for Efficient Open Language Model... - [Reformer](reformer.md): The Reformer model was proposed in the paper [Reformer: The Efficient Transformer](https://huggingface.co/papers/2001... - [RegNet](regnet.md): The RegNet model was proposed in [Designing Network Design Spaces](https://huggingface.co/papers/2003.13678) by Ilija... - [RemBERT](rembert.md): The RemBERT model was proposed in [Rethinking Embedding Coupling in Pre-trained Language Models](https://huggingface.... - [ResNet](resnet.md): The ResNet model was proposed in [Deep Residual Learning for Image Recognition](https://huggingface.co/papers/1512.03... - [RoBERTa-PreLayerNorm](roberta-prelayernorm.md): The RoBERTa-PreLayerNorm model was proposed in [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://hu... - [RoBERTa](roberta.md): [RoBERTa](https://huggingface.co/papers/1907.11692) improves BERT with new pretraining objectives, demonstrating [BER... - [RoCBert](roc-bert.md): [RoCBert](https://aclanthology.org/2022.acl-long.65.pdf) is a pretrained Chinese [BERT](./bert) model designed agains... - [RoFormer](roformer.md): [RoFormer](https://huggingface.co/papers/2104.09864) introduces Rotary Position Embedding (RoPE) to encode token posi... - [Utilities for Rotary Embedding](rope-utils.md): This page explains how the Rotary Embedding is computed and applied in Transformers and what types of RoPE are suppor... - [RT-DETR](rt-detr.md): The RT-DETR model was proposed in [DETRs Beat YOLOs on Real-time Object Detection](https://huggingface.co/papers/2304... - [RT-DETRv2](rt-detr-v2.md): The RT-DETRv2 model was proposed in [RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transf... - [Training scripts](run-scripts.md): Transformers provides many example training scripts for PyTorch and tasks in [transformers/examples](https://github.c... - [RWKV](rwkv.md): The RWKV model (version 4) was proposed in [this repo](https://github.com/BlinkDL/RWKV-LM) - [SAM](sam.md): SAM (Segment Anything Model) was proposed in [Segment Anything](https://huggingface.co/papers/2304.02643) by Alexande... - [SAM2](sam2.md): SAM2 (Segment Anything Model 2) was proposed in [Segment Anything in Images and Videos](https://ai.meta.com/research/... - [SAM2 Video](sam2-video.md): SAM2 (Segment Anything Model 2) was proposed in [Segment Anything in Images and Videos](https://ai.meta.com/research/... - [SAM3](sam3.md): SAM3 (Segment Anything Model 3) was introduced in [SAM 3: Segment Anything with Concepts](https://ai.meta.com/researc... - [SAM3 Tracker](sam3-tracker.md): SAM3 (Segment Anything Model 3) was introduced in [SAM 3: Segment Anything with Concepts](https://ai.meta.com/researc... - [SAM3 Tracker Video](sam3-tracker-video.md): SAM3 (Segment Anything Model 3) was introduced in [SAM 3: Segment Anything with Concepts](https://ai.meta.com/researc... - [SAM3 Video](sam3-video.md): SAM3 (Segment Anything Model 3) was introduced in [SAM 3: Segment Anything with Concepts](https://ai.meta.com/researc... - [SAM-HQ](sam-hq.md): SAM-HQ (High-Quality Segment Anything Model) was proposed in [Segment Anything in High Quality](https://huggingface.c... - [SeamlessM4T](seamless-m4t.md): The SeamlessM4T model was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https:/... - [SeamlessM4T-v2](seamless-m4t-v2.md): The SeamlessM4T-v2 model was proposed in [Seamless: Multilingual Expressive and Streaming Speech Translation](https:/... - [SeedOss](seed-oss.md): To be released with the official model launch. - [SegFormer](segformer.md): [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://huggingface.co/papers/21... - [SegGPT](seggpt.md): The SegGPT model was proposed in [SegGPT: Segmenting Everything In Context](https://huggingface.co/papers/2304.03284)... - [Selecting a quantization method](selecting.md): There are many quantization methods available in Transformers for inference and fine-tuning. This guide helps you cho... - [Image Segmentation](semantic-segmentation.md): Image segmentation models separate areas corresponding to different areas of interest in an image. These models work ... - [Text classification](sequence-classification.md): Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run tex... - [ONNX](serialization.md): [ONNX](http://onnx.ai) is an open standard that defines a common set of operators and a file format to represent deep... - [Serving](serving.md): Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and oth... - [SEW-D](sew-d.md): SEW-D (Squeezed and Efficient Wav2Vec with Disentangled attention) was proposed in [Performance-Efficiency Trade-offs - [SEW](sew.md): SEW (Squeezed and Efficient Wav2Vec) was proposed in [Performance-Efficiency Trade-offs in Unsupervised Pre-training - [SGLang](sglang.md): [SGLang](https://docs.sglang.ai) is a low-latency, high-throughput inference engine for large language models (LLMs).... - [ShieldGemma 2](shieldgemma2.md): The ShieldGemma 2 model was proposed in a [technical report](https://huggingface.co/papers/2504.01081) by Google. Shi... - [SigLIP](siglip.md): [SigLIP](https://huggingface.co/papers/2303.15343) is a multimodal image-text model similar to [CLIP](clip). It uses ... - [SigLIP2](siglip2.md): [SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders that builds ... - [SmolLM3](smollm3.md): [SmolLM3](https://huggingface.co/blog/smollm3) is a fully open, compact language model designed for efficient deploym... - [SmolVLM](smolvlm.md): [SmolVLM2](https://huggingface.co/papers/2504.05299) ([blog post](https://huggingface.co/blog/smolvlm2)) is an adapta... - [SolarOpen](solar-open.md): The SolarOpen model was proposed in [Solar Open Technical Report](https://huggingface.co/papers/2601.07022) by Upstag... - [Speech Encoder Decoder Models](speech-encoder-decoder.md): The [SpeechEncoderDecoderModel](/docs/transformers/v5.0.0/en/model_doc/speech-encoder-decoder#transformers.SpeechEnco... - [Speech2Text](speech-to-text.md): The Speech2Text model was proposed in [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://huggingface.co... - [SpeechT5](speecht5.md): The SpeechT5 model was proposed in [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processi... - [Splinter](splinter.md): The Splinter model was proposed in [Few-Shot Question Answering by Pretraining Span Selection](https://huggingface.co... - [SpQR](spqr.md): The [SpQR](https://hf.co/papers/2306.03078) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quanti... - [SqueezeBERT](squeezebert.md): The SqueezeBERT model was proposed in [SqueezeBERT: What can computer vision teach NLP about efficient neural network... - [StableLM](stablelm.md): StableLM 3B 4E1T ([blog post](https://stability.ai/news/stable-lm-3b-sustainable-high-performance-language-models-sma... - [Starcoder2](starcoder2.md): StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flags... - [Summarization](summarization.md): Summarization creates a shorter version of a document or an article that captures all the important information. Alon... - [SuperGlue](superglue.md): [SuperGlue](https://huggingface.co/papers/1911.11763) is a neural network that matches two sets of local features by ... - [SuperPoint](superpoint.md): [SuperPoint](https://huggingface.co/papers/1712.07629) is the result of self-supervised training of a fully-convoluti... - [SwiftFormer](swiftformer.md): The SwiftFormer model was proposed in [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobi... - [Swin Transformer](swin.md): [Swin Transformer](https://huggingface.co/papers/2103.14030) is a hierarchical vision transformer. Images are process... - [Swin2SR](swin2sr.md): The Swin2SR model was proposed in [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration]... - [Swin Transformer V2](swinv2.md): [Swin Transformer V2](https://huggingface.co/papers/2111.09883) is a 3B parameter model that focuses on how to scale ... - [Switch Transformers](switch-transformers.md): [Switch Transformers](https://huggingface.co/papers/2101.03961) is a sparse T5 model where the MLP layer is replaced ... - [T5](t5.md): [T5](https://huggingface.co/papers/1910.10683) is a encoder-decoder transformer available in a range of sizes from 60... - [T5Gemma](t5gemma.md): T5Gemma (aka encoder-decoder Gemma) was proposed in a [research paper](https://huggingface.co/papers/2504.06225) by G... - [T5Gemma 2](t5gemma2.md): T5Gemma 2 is a family of pretrained encoder-decoder large language models with strong multilingual, multimodal and lo... - [T5v1.1](t5v11.md): T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/tex... - [Table Transformer](table-transformer.md): The Table Transformer model was proposed in [PubTables-1M: Towards comprehensive table extraction from unstructured d... - [TAPAS](tapas.md): The TAPAS model was proposed in [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://huggingface.co/pape... - [TensorRT-LLM](tensorrt-llm.md): [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is optimizes LLM inference on NVIDIA GPUs. It compiles models ... - [Testing](testing.md): Let's take a look at how 🤗 Transformers models are tested and how you can write new tests and improve the existing ones. - [Text to speech](text-to-speech.md): Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in ... - [Generation](text-generation.md): Each framework has a generate method for text generation implemented in their respective`GenerationMixin`class: - [TextNet](textnet.md): The TextNet model was proposed in [FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representatio... - [LiteRT](tflite.md): [LiteRT](https://ai.google.dev/edge/litert) (previously known as TensorFlow Lite) is a high-performance runtime desig... - [Time Series Transformer](time-series-transformer.md): The Time Series Transformer model is a vanilla encoder-decoder Transformer for time series forecasting. - [Time Series Utilities](time-series-utils.md): This page lists all the utility functions and classes that can be used for Time Series based models. - [TimesFM](timesfm.md): TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in [A decoder-only found... - [TimeSformer](timesformer.md): The TimeSformer model was proposed in [TimeSformer: Is Space-Time Attention All You Need for Video Understanding?](ht... - [TimmWrapper](timm-wrapper.md): Helper class to enable loading timm models to be used with the transformers library and its autoclasses. - [Tiny_Agents](tiny-agents.md): To showcase the use of MCP tools, let's see how to integrate the`transformers serve`server with the [`tiny-agents`]... - [Token classification](token-classification.md): Token classification assigns a label to individual tokens in a sentence. One of the most common token classification ... - [Utilities for Tokenizers](tokenization-utils.md): This page lists all the utility functions used by the tokenizers, mainly the class - [Tokenizer](tokenizer.md): A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most - [Summary of the tokenizers](tokenizer-summary.md): On this page, we will have a closer look at tokenization. - [Tools](tools.md): (deprecated) - [torchao](torchao.md): [torchao](https://github.com/pytorch/ao) is a PyTorch architecture optimization library with support for custom high ... - [TorchScript](torchscript.md): [TorchScript](https://pytorch.org/docs/stable/jit.html) serializes PyTorch models into programs that can be executed ... - [torchtitan](torchtitan.md): [torchtitan](https://github.com/pytorch/torchtitan) is PyTorch's distributed training framework for large language mo... - [Trainer](trainer.md): [Trainer](/docs/transformers/v4.57.3/en/main_classes/trainer#transformers.Trainer) is a complete training and evaluat... - [Utilities for Trainer](trainer-utils.md): This page lists all the utility functions used by [Trainer](/docs/transformers/v5.0.0/en/main_classes/trainer#transfo... - [Fine-tuning](training.md): Fine-tuning adapts a pretrained model to a specific task with a smaller specialized dataset. This approach requires f... - [Training Vision Models using Backbone API](training-vision-backbone.md): Computer vision workflows follow a common pattern. Use a pre-trained backbone for feature extraction ([ViT](../model_... - [Building a compatible model backend for inference](transformers-as-backend.md): Transformers models are compatible with inference engines like [vLLM](https://github.com/vllm-project/vllm) and [SGLa... - [Translation](translation.md): Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as... - [TRL](trl.md): [TRL](https://huggingface.co/docs/trl/index) is a post-training framework for foundation models. It includes methods ... - [TrOCR](trocr.md): [TrOCR](https://huggingface.co/papers/2109.10282) is a text recognition model for both image understanding and text g... - [Troubleshoot](troubleshooting.md): Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how ... - [TVP](tvp.md): The text-visual prompting (TVP) framework was proposed in the paper [Text-Visual Prompting for Efficient 2D Temporal ... - [UDOP](udop.md): The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://huggingf... - [UL2](ul2.md): The T5 model was presented in [Unifying Language Learning Paradigms](https://huggingface.co/papers/2205.05131) by Yi ... - [UMT5](umt5.md): The UMT5 model was proposed in [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pret... - [UniSpeech-SAT](unispeech-sat.md): The UniSpeech-SAT model was proposed in [UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware - [UniSpeech](unispeech.md): The UniSpeech model was proposed in [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Dat... - [UnivNet](univnet.md): The UnivNet model was proposed in [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for Hig... - [Unsloth](unsloth.md): [Unsloth](https://unsloth.ai/docs) is a fine-tuning and reinforcement framework that speeds up training and reduces m... - [UPerNet](upernet.md): The UPerNet model was proposed in [Unified Perceptual Parsing for Scene Understanding](https://huggingface.co/papers/... - [VaultGemma](vaultgemma.md): [VaultGemma](https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf) is a text-only decoder model - [Video classification](video-classification.md): Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only o... - [VideoLLaMA3](video-llama-3.md): The [VideoLLaMA3](https://huggingface.co/papers/2501.13106) model is a major update to [VideoLLaMA2](https://huggingf... - [Video-LLaVA](video-llava.md): Video-LLaVa is an open-source multimodal LLM trained by fine-tuning LlamA/Vicuna on multimodal instruction-following ... - [Video Processor](video-processor.md): A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the... - [Video Processor](video-processors.md): A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the... - [Video-text-to-text](video-text-to-text.md): Video-text-to-text, also known as video language models are models that can process video and output text. These mode... - [VideoMAE](videomae.md): The VideoMAE model was proposed in [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Vid... - [ViLT](vilt.md): The ViLT model was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](http... - [VipLlava](vipllava.md): The VipLlava model was proposed in [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://huggi... - [Vision Encoder Decoder Models](vision-encoder-decoder.md): The [VisionEncoderDecoderModel](/docs/transformers/v5.0.0/en/model_doc/vision-encoder-decoder#transformers.VisionEnco... - [VisionTextDualEncoder](vision-text-dual-encoder.md): The [VisionTextDualEncoderModel](/docs/transformers/v5.0.0/en/model_doc/vision-text-dual-encoder#transformers.VisionT... - [VisualBERT](visual-bert.md): [VisualBERT](https://huggingface.co/papers/1908.03557) is a vision-and-language model. It uses an approach called "ea... - [Visual document retrieval](visual-document-retrieval.md): Documents can contain multimodal data if they include charts, tables, and visuals in addition to text. Retrieving inf... - [Visual Question Answering](visual-question-answering.md): Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. - [Vision Transformer (ViT)](vit.md): [Vision Transformer (ViT)](https://huggingface.co/papers/2010.11929) is a transformer adapted for computer vision tas... - [ViTMAE](vit-mae.md): [ViTMAE](https://huggingface.co/papers/2111.06377) is a self-supervised vision model that is pretrained by masking la... - [ViTMSN](vit-msn.md): The ViTMSN model was proposed in [Masked Siamese Networks for Label-Efficient Learning](https://huggingface.co/papers... - [ViTDet](vitdet.md): The ViTDet model was proposed in [Exploring Plain Vision Transformer Backbones for Object Detection](https://huggingf... - [ViTMatte](vitmatte.md): The ViTMatte model was proposed in [Boosting Image Matting with Pretrained Plain Vision Transformers](https://hugging... - [ViTPose](vitpose.md): [ViTPose](https://huggingface.co/papers/2204.12484) is a vision transformer-based model for keypoint (pose) estimatio... - [VITS](vits.md): [VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)](https://huggingface.co/papers/... - [Video Vision Transformer (ViViT)](vivit.md): The Vivit model was proposed in [ViViT: A Video Vision Transformer](https://huggingface.co/papers/2103.15691) by Anur... - [V-JEPA 2](vjepa2.md): [V-JEPA 2](https://huggingface.co/papers/2506.09985) ([blog post](https://ai.meta.com/blog/v-jepa-2-world-model-bench... - [vLLM](vllm.md): [vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for serving LLMs at scale. It cont... - [Voxtral](voxtral.md): Voxtral is an upgrade of [Ministral 3B and Mistral Small 3B](https://mistral.ai/news/ministraux), extending its langu... - [VPTQ](vptq.md): [Vector Post-Training Quantization (VPTQ)](https://github.com/microsoft/VPTQ) is a Post-Training Quantization (PTQ) m... - [Wav2Vec2-BERT](wav2vec2-bert.md): The [Wav2Vec2-BERT](https://huggingface.co/papers/2312.05187) model was proposed in [Seamless: Multilingual Expressiv... - [Wav2Vec2-Conformer](wav2vec2-conformer.md): The Wav2Vec2-Conformer was added to an updated version of [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](ht... - [Wav2Vec2](wav2vec2.md): The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](... - [Wav2Vec2Phoneme](wav2vec2-phoneme.md): The Wav2Vec2Phoneme model was proposed in [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition (Xu et al., - [WavLM](wavlm.md): The WavLM model was proposed in [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](ht... - [Whisper](whisper.md): [Whisper](https://huggingface.co/papers/2212.04356) is a encoder-decoder (sequence-to-sequence) transformer pretraine... - [X-CLIP](xclip.md): The X-CLIP model was proposed in [Expanding Language-Image Pretrained Models for General Video Recognition](https://h... - [X-Codec](xcodec.md): The X-Codec model was proposed in [Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language ... - [XGLM](xglm.md): The XGLM model was proposed in [Few-shot Learning with Multilingual Language Models](https://huggingface.co/papers/21... - [XLM-RoBERTa-XL](xlm-roberta-xl.md): [XLM-RoBERTa-XL](https://huggingface.co/papers/2105.00572) is a 3.5B parameter multilingual masked language model pre... - [XLM-RoBERTa](xlm-roberta.md): [XLM-RoBERTa](https://huggingface.co/papers/1911.02116) is a large multilingual masked language model trained on 2.5T... - [XLM-V](xlm-v.md): XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (... - [XLM](xlm.md): [XLM](https://huggingface.co/papers/1901.07291) demonstrates cross-lingual pretraining with two approaches, unsupervi... - [XLNet](xlnet.md): The XLNet model was proposed in [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://hu... - [XLS-R](xls-r.md): The XLS-R model was proposed in [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https:... - [XLSR-Wav2Vec2](xlsr-wav2vec2.md): The XLSR-Wav2Vec2 model was proposed in [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](h... - [xLSTM](xlstm.md): The xLSTM model was proposed in [xLSTM: Extended Long Short-Term Memory](https://huggingface.co/papers/2405.04517) by... - [X-MOD](xmod.md): The X-MOD model was proposed in [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](https://h... - [YOLOS](yolos.md): [YOLOS](https://huggingface.co/papers/2106.00666) uses a [Vision Transformer (ViT)](./vit) for object detection with ... - [YOSO](yoso.md): The YOSO model was proposed in [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](htt... - [Zamba](zamba.md): [Zamba](https://huggingface.co/papers/2405.16712) ([blog post](https://www.zyphra.com/post/zamba)) is a large languag... - [Zamba2](zamba2.md): [Zamba2](https://huggingface.co/papers/2411.15242) is a large language model (LLM) trained by Zyphra, and made availa... - [Zero-shot image classification](zero-shot-image-classification.md): Zero-shot image classification is a task that involves classifying images into different categories using a model tha... - [Zero-shot object detection](zero-shot-object-detection.md): Traditionally, models used for [object detection](object_detection) require labeled image datasets for training, - [ZoeDepth](zoedepth.md): [ZoeDepth](https://huggingface.co/papers/2302.12288) is a depth estimation model that combines the generalization per...