# Vllm

> This section guides you through running benchmark tests with the extensive datasets supported on vLLM.

## Pages

- [Benchmark CLI](benchmarking-cli.md): This section guides you through running benchmark tests with the extensive datasets supported on vLLM.
- [Performance Dashboard](benchmarking-dashboard.md): The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads.
- [Parameter Sweeps](benchmarking-sweeps.md): `vllm bench sweep serve`automatically starts`vllm serve`and runs`vllm bench serve`to evaluate vLLM over multiple...
- [vllm bench latency](cli-bench-latency.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench serve](cli-bench-serve.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench sweep plot](cli-bench-sweep-plot.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench sweep plot_pareto](cli-bench-sweep-plot-pareto.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench sweep serve](cli-bench-sweep-serve.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench sweep serve_sla](cli-bench-sweep-serve-sla.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm bench throughput](cli-bench-throughput.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm chat](cli-chat.md): --8<-- "docs/generated/argparse/chat.inc.md"
- [vllm complete](cli-complete.md): --8<-- "docs/generated/argparse/complete.inc.md"
- [Json_Tip.Inc](cli-json-tipinc.md): When passing JSON CLI arguments, the following sets of arguments are equivalent:
- [vllm run-batch](cli-run-batch.md): --8<-- "docs/cli/json_tip.inc.md"
- [vllm serve](cli-serve.md): --8<-- "docs/cli/json_tip.inc.md"
- [Meetups](community-meetups.md): We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speaker...
- [Sponsors](community-sponsors.md): vLLM is a community project. Our compute resources for development and testing are supported by the following organiz...
- [Conserving Memory](configuration-conserving-memory.md): Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this prob...
- [Engine Arguments](configuration-engine-args.md): Engine arguments control the behavior of the vLLM engine.
- [Environment Variables](configuration-env-vars.md): vLLM uses the following environment variables to configure the system:
- [Model Resolution](configuration-model-resolution.md): vLLM loads HuggingFace-compatible models by inspecting the`architectures`field in`config.json`of the model reposi...
- [Optimization and Tuning](configuration-optimization.md): This guide covers optimization strategies and performance tuning for vLLM V1.
- [Server Arguments](configuration-serve-args.md): The`vllm serve`command is used to launch the OpenAI-compatible server.
- [CI Failures](contributing-ci-failures.md): What should I do when a CI job fails on my PR, but I don't think my PR caused
- [Nightly Builds of vLLM Wheels](contributing-ci-nightly-builds.md): vLLM maintains a per-commit wheel repository (commonly referred to as "nightly") at`that pro...
- [Update PyTorch version on vLLM OSS CI/CD](contributing-ci-update-pytorch-version.md): vLLM's current policy is to always use the latest PyTorch stable
- [Deprecation Policy](contributing-deprecation-policy.md): This document outlines the official policy and process for deprecating features
- [Dockerfile](contributing-dockerfile-dockerfile.md): We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible ...
- [Incremental Compilation Workflow](contributing-incremental-build.md): When working on vLLM's C++/CUDA kernels located in the`csrc/`directory, recompiling the entire project with`uv pip...
- [Basic Model](contributing-model-basic.md): This guide walks you through the steps to implement a basic vLLM model.
- [Multi-Modal Support](contributing-model-multimodal.md): This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../featu...
- [Registering a Model](contributing-model-registration.md): vLLM relies on a model registry to determine how to run each model.
- [Unit Testing](contributing-model-tests.md): This page explains how to write unit tests to verify the implementation of your model.
- [Speech-to-Text (Transcription/Translation) Support](contributing-model-transcription.md): This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and ...
- [Profiling vLLM](contributing-profiling.md): !!! warning
- [Vulnerability Management](contributing-vulnerability-management.md): As mentioned in the [security
- [Using Docker](deployment-docker.md): vLLM offers an official Docker image for deployment.
- [Anyscale](deployment-frameworks-anyscale.md): [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
- [AnythingLLM](deployment-frameworks-anything-llm.md): [AnythingLLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any...
- [AutoGen](deployment-frameworks-autogen.md): [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act ...
- [BentoML](deployment-frameworks-bentoml.md): [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as t...
- [Cerebrium](deployment-frameworks-cerebrium.md): vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastruct...
- [Chatbox](deployment-frameworks-chatbox.md): [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.
- [Dify](deployment-frameworks-dify.md): [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface co...
- [dstack](deployment-frameworks-dstack.md): vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running ...
- [Haystack](deployment-frameworks-haystack.md): [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applicatio...
- [Helm](deployment-frameworks-helm.md): A Helm chart to deploy vLLM for Kubernetes
- [Hugging Face Inference Endpoints](deployment-frameworks-hf-inference-endpoints.md): Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Fa...
- [LiteLLM](deployment-frameworks-litellm.md): [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, Vertex...
- [Lobe Chat](deployment-frameworks-lobe-chat.md): [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.
- [LWS](deployment-frameworks-lws.md): LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
- [Modal](deployment-frameworks-modal.md): vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto...
- [Open WebUI](deployment-frameworks-open-webui.md): [Open WebUI](https://github.com/open-webui/open-webui) is an extensible, feature-rich,
- [Retrieval-Augmented Generation](deployment-frameworks-retrieval-augmented-generation.md): [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique t...
- [SkyPilot](deployment-frameworks-skypilot.md): vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.c...
- [Streamlit](deployment-frameworks-streamlit.md): [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in mi...
- [NVIDIA Triton](deployment-frameworks-triton.md): The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quick...
- [KAITO](deployment-integrations-kaito.md): [KAITO](https://kaito-project.github.io/kaito/docs/) is a Kubernetes operator that supports deploying and serving LLM...
- [KServe](deployment-integrations-kserve.md): vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed mo...
- [Kthena](deployment-integrations-kthena.md): [**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how ...
- [KubeAI](deployment-integrations-kubeai.md): [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI mo...
- [KubeRay](deployment-integrations-kuberay.md): [KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clust...
- [Llama Stack](deployment-integrations-llamastack.md): vLLM is also available via [Llama Stack](https://github.com/llamastack/llama-stack).
- [llm-d](deployment-integrations-llm-d.md): vLLM can be deployed with [llm-d](https://github.com/llm-d/llm-d), a Kubernetes-native distributed inference serving ...
- [llmaz](deployment-integrations-llmaz.md): [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models...
- [Production stack](deployment-integrations-production-stack.md): Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you t...
- [Using Kubernetes](deployment-k8s.md): Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you t...
- [Using Nginx](deployment-nginx.md): This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between th...
- [Architecture Overview](design-arch-overview.md): This document provides an overview of the vLLM architecture.
- [CUDA Graphs](design-cuda-graphs.md): This write-up introduces the new CUDA Graphs modes in vLLM v1 beyond previous [torch.compile integration](torch_compi...
- [Dual Batch Overlap](design-dbo.md): The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with...
- [How to debug the vLLM-torch.compile integration](design-debug-vllm-compile.md): TL;DR:
- [Fused MoE Modular Kernel](design-fused-moe-modular-kernel.md): FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
- [Integration with Hugging Face](design-huggingface-integration.md): This document describes how vLLM integrates with Hugging Face libraries. We will explain step by step what happens un...
- [Hybrid KV Cache Manager](design-hybrid-kv-cache-manager.md): !!! warning
- [IO Processor Plugins](design-io-processor-plugins.md): IO Processor plugins are a feature that allows pre- and post-processing of the model input and output for pooling mod...
- [Logits Processors](design-logits-processors.md): !!! important
- [LoRA Resolver Plugins](design-lora-resolver-plugins.md): This directory contains vLLM's LoRA resolver plugins built on the`LoRAResolver`framework.
- [Metrics](design-metrics.md): vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.
- [Multi-Modal Data Processing](design-mm-processing.md): To enable various optimizations in vLLM such as [chunked prefill](../configuration/optimization.md#chunked-prefill) a...
- [Fused MoE Kernel Features](design-moe-kernel-features.md): The purpose of this document is to provide an overview of the various MoE kernels (both modular and non-modular) so i...
- [Python Multiprocessing](design-multiprocessing.md): Please see the [Troubleshooting](../usage/troubleshooting.md#python-multiprocessing)
- [Optimization Levels](design-optimization-levels.md): vLLM now supports optimization levels (`-O0`,`-O1`,`-O2`,`-O3`). Optimization levels provide an intuitive mechanis...
- [P2P NCCL Connector](design-p2p-nccl-connector.md): An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo.
- [Paged Attention](design-paged-attention.md): !!! warning
- [Plugin System](design-plugin-system.md): The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes ...
- [Automatic Prefix Caching](design-prefix-caching.md): Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The...
- [`torch.compile` integration](design-torch-compile.md): In vLLM's V1 architecture,`torch.compile`is enabled by default and is a critical part of the framework. This docume...
- [Automatic Prefix Caching](features-automatic-prefix-caching.md): Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reu...
- [Batch Invariance](features-batch-invariance.md): !!! note
- [Custom Arguments](features-custom-arguments.md): You can use vLLM *custom arguments* to pass in arguments which are not part of the vLLM`SamplingParams`and REST API...
- [Custom Logits Processors](features-custom-logitsprocs.md): !!! important
- [Disaggregated Encoder](features-disagg-encoder.md): A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the ...
- [Disaggregated Prefilling (experimental)](features-disagg-prefill.md): This page introduces you the disaggregated prefilling feature in vLLM.
- [Interleaved Thinking](features-interleaved-thinking.md): Interleaved thinking allows models to reason between tool calls, enabling more sophisticated decision-making after re...
- [LoRA Adapters](features-lora.md): This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
- [MooncakeConnector Usage Guide](features-mooncake-connector-usage.md): Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage ...
- [Multimodal Inputs](features-multimodal-inputs.md): This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-mu...
- [NixlConnector Usage Guide](features-nixl-connector-usage.md): NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provi...
- [Prompt Embedding Inputs](features-prompt-embeds.md): This page teaches you how to pass prompt embedding inputs to vLLM.
- [AutoAWQ](features-quantization-auto-awq.md): ⚠️ **Warning:**
- [AutoRound](features-quantization-auto-round.md): [AutoRound](https://github.com/intel/auto-round) is Intel’s advanced quantization algorithm designed to produce highl...
- [BitBLAS](features-quantization-bitblas.md): vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Co...
- [BitsAndBytes](features-quantization-bnb.md): vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
- [FP8 W8A8](features-quantization-fp8.md): vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such ...
- [GGUF](features-quantization-gguf.md): !!! warning
- [GPTQModel](features-quantization-gptqmodel.md): To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQM...
- [FP8 INC](features-quantization-inc.md): vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on I...
- [INT4 W4A16](features-quantization-int4.md): vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is p...
- [INT8 W8A8](features-quantization-int8.md): vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
- [NVIDIA Model Optimizer](features-quantization-modelopt.md): The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is a library designed to optimize models for ...
- [Quantized KV Cache](features-quantization-quantized-kvcache.md): Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored i...
- [AMD Quark](features-quantization-quark.md): Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
- [TorchAO](features-quantization-torchao.md): TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techni...
- [Reasoning Outputs](features-reasoning-outputs.md): vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which ar...
- [Sleep Mode](features-sleep-mode.md): vLLM's Sleep Mode allows you to temporarily release most GPU memory used by a model, including model weights and KV c...
- [Speculative Decoding](features-spec-decode.md): !!! warning
- [Structured Outputs](features-structured-outputs.md): vLLM supports the generation of structured outputs using
- [Tool Calling](features-tool-calling.md): vLLM currently supports named function calling, as well as the`auto`,`required`(as of`vllm>=0.8.3`), and`none`o...
- [--8<-- [start:installation]](getting-started-installation-cpuappleinc.md): vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on ...
- [--8<-- [start:installation]](getting-started-installation-cpuarminc.md): vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16.
- [CPU](getting-started-installation-cpu.md): vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instru...
- [--8<-- [start:installation]](getting-started-installation-cpus390xinc.md): vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to nati...
- [--8<-- [start:installation]](getting-started-installation-cpux86inc.md): vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
- [Installation](getting-started-installation-devicetemplate.md)
- [--8<-- [start:installation]](getting-started-installation-gpucudainc.md): vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
- [GPU](getting-started-installation-gpu.md): vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instru...
- [--8<-- [start:installation]](getting-started-installation-gpurocminc.md): vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above.
- [--8<-- [start:installation]](getting-started-installation-gpuxpuinc.md): vLLM initially supports basic model inference and serving on Intel GPU platform.
- [Python_Env_Setup.Inc](getting-started-installation-python-env-setupinc.md): It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manag...
- [Quickstart](getting-started-quickstart.md): This guide will help you quickly get started with vLLM to perform:
- [Collaboration Policy](governance-collaboration.md): This page outlines how vLLM collaborates with model providers, hardware vendors, and other stakeholders.
- [Committers](governance-committers.md): This document lists the current committers of the vLLM project and the core areas they maintain.
- [Governance Process](governance-process.md): vLLM's success comes from our strong open source community. We favor informal, meritocratic norms over formal policie...
- [Loading Model weights with fastsafetensors](models-extensions-fastsafetensor.md): Loading Model weights with fastsafetensors
- [Loading models with Run:ai Model Streamer](models-extensions-runai-model-streamer.md): Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
- [Loading models with CoreWeave's Tensorizer](models-extensions-tensorizer.md): vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-...
- [Generative Models](models-generative-models.md): vLLM provides first-class support for generative models, which covers most of LLMs.
- [CPU - Intel® Xeon®](models-hardware-supported-models-cpu.md): | Hardware |
- [XPU - Intel® GPUs](models-hardware-supported-models-xpu.md): | Hardware |
- [Pooling Models](models-pooling-models.md): vLLM also supports pooling models, such as embedding, classification, and reward models.
- [Supported Models](models-supported-models.md): vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
- [Context Parallel Deployment](serving-context-parallel-deployment.md): Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite diff...
- [Data Parallel Deployment](serving-data-parallel-deployment.md): vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process ...
- [Troubleshooting distributed deployments](serving-distributed-troubleshooting.md): For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
- [Expert Parallel Deployment](serving-expert-parallel-deployment.md): vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on sepa...
- [LangChain](serving-integrations-langchain.md): vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
- [LlamaIndex](serving-integrations-llamaindex.md): vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
- [Offline Inference](serving-offline-inference.md): Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
- [OpenAI-Compatible Server](serving-openai-compatible-server.md): vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-referenc...
- [Parallelism and Scaling](serving-parallelism-scaling.md): To choose a distributed inference strategy for a single-model replica, use the following guidelines:
- [Reinforcement Learning from Human Feedback](training-rlhf.md): Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generate...
- [Transformers Reinforcement Learning](training-trl.md): [Transformers Reinforcement Learning](https://huggingface.co/docs/trl) (TRL) is a full stack library that provides a ...
- [Frequently Asked Questions](usage-faq.md): Q: How can I serve multiple models on a single port using the OpenAI API?
- [Production Metrics](usage-metrics.md): vLLM exposes a number of metrics that can be used to monitor the health of the
- [Reproducibility](usage-reproducibility.md): vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve
- [Security](usage-security.md): All communications between nodes in a multi-node vLLM deployment are **insecure by default** and must be protected by...
- [Troubleshooting](usage-troubleshooting.md): This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please...
- [Usage Stats Collection](usage-usage-stats.md): vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model...
- [vLLM V1](usage-v1-guide.md): !!! announcement