# Vllm > This section guides you through running benchmark tests with the extensive datasets supported on vLLM. ## Pages - [Benchmark CLI](benchmarking-cli.md): This section guides you through running benchmark tests with the extensive datasets supported on vLLM. - [Performance Dashboard](benchmarking-dashboard.md): The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads. - [Parameter Sweeps](benchmarking-sweeps.md): `vllm bench sweep serve`automatically starts`vllm serve`and runs`vllm bench serve`to evaluate vLLM over multiple... - [vllm bench latency](cli-bench-latency.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench serve](cli-bench-serve.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench sweep plot](cli-bench-sweep-plot.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench sweep plot_pareto](cli-bench-sweep-plot-pareto.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench sweep serve](cli-bench-sweep-serve.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench sweep serve_sla](cli-bench-sweep-serve-sla.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm bench throughput](cli-bench-throughput.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm chat](cli-chat.md): --8<-- "docs/generated/argparse/chat.inc.md" - [vllm complete](cli-complete.md): --8<-- "docs/generated/argparse/complete.inc.md" - [Json_Tip.Inc](cli-json-tipinc.md): When passing JSON CLI arguments, the following sets of arguments are equivalent: - [vllm run-batch](cli-run-batch.md): --8<-- "docs/cli/json_tip.inc.md" - [vllm serve](cli-serve.md): --8<-- "docs/cli/json_tip.inc.md" - [Meetups](community-meetups.md): We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speaker... - [Sponsors](community-sponsors.md): vLLM is a community project. Our compute resources for development and testing are supported by the following organiz... - [Conserving Memory](configuration-conserving-memory.md): Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this prob... - [Engine Arguments](configuration-engine-args.md): Engine arguments control the behavior of the vLLM engine. - [Environment Variables](configuration-env-vars.md): vLLM uses the following environment variables to configure the system: - [Model Resolution](configuration-model-resolution.md): vLLM loads HuggingFace-compatible models by inspecting the`architectures`field in`config.json`of the model reposi... - [Optimization and Tuning](configuration-optimization.md): This guide covers optimization strategies and performance tuning for vLLM V1. - [Server Arguments](configuration-serve-args.md): The`vllm serve`command is used to launch the OpenAI-compatible server. - [CI Failures](contributing-ci-failures.md): What should I do when a CI job fails on my PR, but I don't think my PR caused - [Nightly Builds of vLLM Wheels](contributing-ci-nightly-builds.md): vLLM maintains a per-commit wheel repository (commonly referred to as "nightly") at`that pro... - [Update PyTorch version on vLLM OSS CI/CD](contributing-ci-update-pytorch-version.md): vLLM's current policy is to always use the latest PyTorch stable - [Deprecation Policy](contributing-deprecation-policy.md): This document outlines the official policy and process for deprecating features - [Dockerfile](contributing-dockerfile-dockerfile.md): We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible ... - [Incremental Compilation Workflow](contributing-incremental-build.md): When working on vLLM's C++/CUDA kernels located in the`csrc/`directory, recompiling the entire project with`uv pip... - [Basic Model](contributing-model-basic.md): This guide walks you through the steps to implement a basic vLLM model. - [Multi-Modal Support](contributing-model-multimodal.md): This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../featu... - [Registering a Model](contributing-model-registration.md): vLLM relies on a model registry to determine how to run each model. - [Unit Testing](contributing-model-tests.md): This page explains how to write unit tests to verify the implementation of your model. - [Speech-to-Text (Transcription/Translation) Support](contributing-model-transcription.md): This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and ... - [Profiling vLLM](contributing-profiling.md): !!! warning - [Vulnerability Management](contributing-vulnerability-management.md): As mentioned in the [security - [Using Docker](deployment-docker.md): vLLM offers an official Docker image for deployment. - [Anyscale](deployment-frameworks-anyscale.md): [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. - [AnythingLLM](deployment-frameworks-anything-llm.md): [AnythingLLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any... - [AutoGen](deployment-frameworks-autogen.md): [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act ... - [BentoML](deployment-frameworks-bentoml.md): [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as t... - [Cerebrium](deployment-frameworks-cerebrium.md): vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastruct... - [Chatbox](deployment-frameworks-chatbox.md): [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux. - [Dify](deployment-frameworks-dify.md): [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface co... - [dstack](deployment-frameworks-dstack.md): vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running ... - [Haystack](deployment-frameworks-haystack.md): [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applicatio... - [Helm](deployment-frameworks-helm.md): A Helm chart to deploy vLLM for Kubernetes - [Hugging Face Inference Endpoints](deployment-frameworks-hf-inference-endpoints.md): Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Fa... - [LiteLLM](deployment-frameworks-litellm.md): [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, Vertex... - [Lobe Chat](deployment-frameworks-lobe-chat.md): [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework. - [LWS](deployment-frameworks-lws.md): LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. - [Modal](deployment-frameworks-modal.md): vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto... - [Open WebUI](deployment-frameworks-open-webui.md): [Open WebUI](https://github.com/open-webui/open-webui) is an extensible, feature-rich, - [Retrieval-Augmented Generation](deployment-frameworks-retrieval-augmented-generation.md): [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique t... - [SkyPilot](deployment-frameworks-skypilot.md): vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.c... - [Streamlit](deployment-frameworks-streamlit.md): [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in mi... - [NVIDIA Triton](deployment-frameworks-triton.md): The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quick... - [KAITO](deployment-integrations-kaito.md): [KAITO](https://kaito-project.github.io/kaito/docs/) is a Kubernetes operator that supports deploying and serving LLM... - [KServe](deployment-integrations-kserve.md): vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed mo... - [Kthena](deployment-integrations-kthena.md): [**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how ... - [KubeAI](deployment-integrations-kubeai.md): [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI mo... - [KubeRay](deployment-integrations-kuberay.md): [KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clust... - [Llama Stack](deployment-integrations-llamastack.md): vLLM is also available via [Llama Stack](https://github.com/llamastack/llama-stack). - [llm-d](deployment-integrations-llm-d.md): vLLM can be deployed with [llm-d](https://github.com/llm-d/llm-d), a Kubernetes-native distributed inference serving ... - [llmaz](deployment-integrations-llmaz.md): [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models... - [Production stack](deployment-integrations-production-stack.md): Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you t... - [Using Kubernetes](deployment-k8s.md): Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you t... - [Using Nginx](deployment-nginx.md): This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between th... - [Architecture Overview](design-arch-overview.md): This document provides an overview of the vLLM architecture. - [CUDA Graphs](design-cuda-graphs.md): This write-up introduces the new CUDA Graphs modes in vLLM v1 beyond previous [torch.compile integration](torch_compi... - [Dual Batch Overlap](design-dbo.md): The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with... - [How to debug the vLLM-torch.compile integration](design-debug-vllm-compile.md): TL;DR: - [Fused MoE Modular Kernel](design-fused-moe-modular-kernel.md): FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py) - [Integration with Hugging Face](design-huggingface-integration.md): This document describes how vLLM integrates with Hugging Face libraries. We will explain step by step what happens un... - [Hybrid KV Cache Manager](design-hybrid-kv-cache-manager.md): !!! warning - [IO Processor Plugins](design-io-processor-plugins.md): IO Processor plugins are a feature that allows pre- and post-processing of the model input and output for pooling mod... - [Logits Processors](design-logits-processors.md): !!! important - [LoRA Resolver Plugins](design-lora-resolver-plugins.md): This directory contains vLLM's LoRA resolver plugins built on the`LoRAResolver`framework. - [Metrics](design-metrics.md): vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine. - [Multi-Modal Data Processing](design-mm-processing.md): To enable various optimizations in vLLM such as [chunked prefill](../configuration/optimization.md#chunked-prefill) a... - [Fused MoE Kernel Features](design-moe-kernel-features.md): The purpose of this document is to provide an overview of the various MoE kernels (both modular and non-modular) so i... - [Python Multiprocessing](design-multiprocessing.md): Please see the [Troubleshooting](../usage/troubleshooting.md#python-multiprocessing) - [Optimization Levels](design-optimization-levels.md): vLLM now supports optimization levels (`-O0`,`-O1`,`-O2`,`-O3`). Optimization levels provide an intuitive mechanis... - [P2P NCCL Connector](design-p2p-nccl-connector.md): An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo. - [Paged Attention](design-paged-attention.md): !!! warning - [Plugin System](design-plugin-system.md): The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes ... - [Automatic Prefix Caching](design-prefix-caching.md): Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The... - [`torch.compile` integration](design-torch-compile.md): In vLLM's V1 architecture,`torch.compile`is enabled by default and is a critical part of the framework. This docume... - [Automatic Prefix Caching](features-automatic-prefix-caching.md): Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reu... - [Batch Invariance](features-batch-invariance.md): !!! note - [Custom Arguments](features-custom-arguments.md): You can use vLLM *custom arguments* to pass in arguments which are not part of the vLLM`SamplingParams`and REST API... - [Custom Logits Processors](features-custom-logitsprocs.md): !!! important - [Disaggregated Encoder](features-disagg-encoder.md): A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the ... - [Disaggregated Prefilling (experimental)](features-disagg-prefill.md): This page introduces you the disaggregated prefilling feature in vLLM. - [Interleaved Thinking](features-interleaved-thinking.md): Interleaved thinking allows models to reason between tool calls, enabling more sophisticated decision-making after re... - [LoRA Adapters](features-lora.md): This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. - [MooncakeConnector Usage Guide](features-mooncake-connector-usage.md): Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage ... - [Multimodal Inputs](features-multimodal-inputs.md): This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-mu... - [NixlConnector Usage Guide](features-nixl-connector-usage.md): NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provi... - [Prompt Embedding Inputs](features-prompt-embeds.md): This page teaches you how to pass prompt embedding inputs to vLLM. - [AutoAWQ](features-quantization-auto-awq.md): ⚠️ **Warning:** - [AutoRound](features-quantization-auto-round.md): [AutoRound](https://github.com/intel/auto-round) is Intel’s advanced quantization algorithm designed to produce highl... - [BitBLAS](features-quantization-bitblas.md): vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Co... - [BitsAndBytes](features-quantization-bnb.md): vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. - [FP8 W8A8](features-quantization-fp8.md): vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such ... - [GGUF](features-quantization-gguf.md): !!! warning - [GPTQModel](features-quantization-gptqmodel.md): To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQM... - [FP8 INC](features-quantization-inc.md): vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on I... - [INT4 W4A16](features-quantization-int4.md): vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is p... - [INT8 W8A8](features-quantization-int8.md): vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. - [NVIDIA Model Optimizer](features-quantization-modelopt.md): The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is a library designed to optimize models for ... - [Quantized KV Cache](features-quantization-quantized-kvcache.md): Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored i... - [AMD Quark](features-quantization-quark.md): Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve - [TorchAO](features-quantization-torchao.md): TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techni... - [Reasoning Outputs](features-reasoning-outputs.md): vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which ar... - [Sleep Mode](features-sleep-mode.md): vLLM's Sleep Mode allows you to temporarily release most GPU memory used by a model, including model weights and KV c... - [Speculative Decoding](features-spec-decode.md): !!! warning - [Structured Outputs](features-structured-outputs.md): vLLM supports the generation of structured outputs using - [Tool Calling](features-tool-calling.md): vLLM currently supports named function calling, as well as the`auto`,`required`(as of`vllm>=0.8.3`), and`none`o... - [--8<-- [start:installation]](getting-started-installation-cpuappleinc.md): vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on ... - [--8<-- [start:installation]](getting-started-installation-cpuarminc.md): vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16. - [CPU](getting-started-installation-cpu.md): vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instru... - [--8<-- [start:installation]](getting-started-installation-cpus390xinc.md): vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to nati... - [--8<-- [start:installation]](getting-started-installation-cpux86inc.md): vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. - [Installation](getting-started-installation-devicetemplate.md) - [--8<-- [start:installation]](getting-started-installation-gpucudainc.md): vLLM contains pre-compiled C++ and CUDA (12.8) binaries. - [GPU](getting-started-installation-gpu.md): vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instru... - [--8<-- [start:installation]](getting-started-installation-gpurocminc.md): vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above. - [--8<-- [start:installation]](getting-started-installation-gpuxpuinc.md): vLLM initially supports basic model inference and serving on Intel GPU platform. - [Python_Env_Setup.Inc](getting-started-installation-python-env-setupinc.md): It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manag... - [Quickstart](getting-started-quickstart.md): This guide will help you quickly get started with vLLM to perform: - [Collaboration Policy](governance-collaboration.md): This page outlines how vLLM collaborates with model providers, hardware vendors, and other stakeholders. - [Committers](governance-committers.md): This document lists the current committers of the vLLM project and the core areas they maintain. - [Governance Process](governance-process.md): vLLM's success comes from our strong open source community. We favor informal, meritocratic norms over formal policie... - [Loading Model weights with fastsafetensors](models-extensions-fastsafetensor.md): Loading Model weights with fastsafetensors - [Loading models with Run:ai Model Streamer](models-extensions-runai-model-streamer.md): Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. - [Loading models with CoreWeave's Tensorizer](models-extensions-tensorizer.md): vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-... - [Generative Models](models-generative-models.md): vLLM provides first-class support for generative models, which covers most of LLMs. - [CPU - Intel® Xeon®](models-hardware-supported-models-cpu.md): | Hardware | - [XPU - Intel® GPUs](models-hardware-supported-models-xpu.md): | Hardware | - [Pooling Models](models-pooling-models.md): vLLM also supports pooling models, such as embedding, classification, and reward models. - [Supported Models](models-supported-models.md): vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. - [Context Parallel Deployment](serving-context-parallel-deployment.md): Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite diff... - [Data Parallel Deployment](serving-data-parallel-deployment.md): vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process ... - [Troubleshooting distributed deployments](serving-distributed-troubleshooting.md): For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md). - [Expert Parallel Deployment](serving-expert-parallel-deployment.md): vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on sepa... - [LangChain](serving-integrations-langchain.md): vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . - [LlamaIndex](serving-integrations-llamaindex.md): vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . - [Offline Inference](serving-offline-inference.md): Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class. - [OpenAI-Compatible Server](serving-openai-compatible-server.md): vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-referenc... - [Parallelism and Scaling](serving-parallelism-scaling.md): To choose a distributed inference strategy for a single-model replica, use the following guidelines: - [Reinforcement Learning from Human Feedback](training-rlhf.md): Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generate... - [Transformers Reinforcement Learning](training-trl.md): [Transformers Reinforcement Learning](https://huggingface.co/docs/trl) (TRL) is a full stack library that provides a ... - [Frequently Asked Questions](usage-faq.md): Q: How can I serve multiple models on a single port using the OpenAI API? - [Production Metrics](usage-metrics.md): vLLM exposes a number of metrics that can be used to monitor the health of the - [Reproducibility](usage-reproducibility.md): vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve - [Security](usage-security.md): All communications between nodes in a multi-node vLLM deployment are **insecure by default** and must be protected by... - [Troubleshooting](usage-troubleshooting.md): This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please... - [Usage Stats Collection](usage-usage-stats.md): vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model... - [vLLM V1](usage-v1-guide.md): !!! announcement