# Vllm > This section guides you through running benchmark tests with the extensive datasets supported on vLLM. --- # Benchmark CLI This section guides you through running benchmark tests with the extensive datasets supported on vLLM. It's a living document, updated as new features and datasets become available. ## Dataset Overview | Dataset | Online | Offline | Data Path | |---------|--------|---------|-----------| | ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` | | ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`
Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:
`wget http://images.cocodataset.org/zips/train2017.zip` | | ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` | | BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` | | Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` | | Random | ✅ | ✅ | `synthetic` | | RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` | | RandomForReranking | ✅ | ✅ | `synthetic` | | Prefix Repetition | ✅ | ✅ | `synthetic` | | HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` | | HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` | | HuggingFace-InstructCoder | ✅ | ✅ | `likaixin/InstructCoder` | | HuggingFace-AIMO | ✅ | ✅ | `AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT` | | HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` | | HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` | | HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` | | Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` | | Custom | ✅ | ✅ | Local file: `data.jsonl` | Legend: - ✅ - supported - 🟡 - Partial support - 🚧 - to be supported !!! note HuggingFace dataset's `dataset-name` should be set to `hf`. For local `dataset-path`, please set `hf-name` to its Hugging Face ID like ```bash --dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat ``` ## Examples ### 🚀 Online Benchmark

First start serving your model: ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B ``` Then run the benchmarking script: ```bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --endpoint /v1/completions \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts 10 ``` If successful, you will see the following output: ```text ============ Serving Benchmark Result ============ Successful requests: 10 Benchmark duration (s): 5.78 Total input tokens: 1369 Total generated tokens: 2212 Request throughput (req/s): 1.73 Output token throughput (tok/s): 382.89 Total token throughput (tok/s): 619.85 ---------------Time to First Token---------------- Mean TTFT (ms): 71.54 Median TTFT (ms): 73.88 P99 TTFT (ms): 79.49 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 7.91 Median TPOT (ms): 7.96 P99 TPOT (ms): 8.03 ---------------Inter-token Latency---------------- Mean ITL (ms): 7.74 Median ITL (ms): 7.70 P99 ITL (ms): 8.39 ================================================== ``` #### Custom Dataset If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl ```json {"prompt": "What is the capital of India?"} {"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of China?"} ``` ```bash # start server vllm serve meta-llama/Llama-3.1-8B-Instruct ``` ```bash # run benchmarking script vllm bench serve --port 9001 --save-result --save-detailed \ --backend vllm \ --model meta-llama/Llama-3.1-8B-Instruct \ --endpoint /v1/completions \ --dataset-name custom \ --dataset-path \ --custom-skip-chat-template \ --num-prompts 80 \ --max-concurrency 1 \ --temperature=0.3 \ --top-p=0.75 \ --result-dir "./log/" ``` You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. #### VisionArena Benchmark for Vision Language Models ```bash # need a model with vision capability here vllm serve Qwen/Qwen2-VL-7B-Instruct ``` ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --hf-split train \ --num-prompts 1000 ``` #### InstructCoder Benchmark with Speculative Decoding ``` bash vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ --speculative-config $'{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}' ``` ``` bash vllm bench serve \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --num-prompts 2048 ``` #### Spec Bench Benchmark with Speculative Decoding ``` bash vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ --speculative-config $'{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}' ``` [SpecBench dataset](https://github.com/hemingkx/Spec-Bench) Run all categories: ``` bash # Download the dataset using: # wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl vllm bench serve \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --dataset-name spec_bench \ --dataset-path "/data/spec_bench/question.jsonl" \ --num-prompts -1 ``` Available categories include `[writing, roleplay, reasoning, math, coding, extraction, stem, humanities, translation, summarization, qa, math_reasoning, rag]`. Run only a specific category like "summarization": ``` bash vllm bench serve \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --dataset-name spec_bench \ --dataset-path "/data/spec_bench/question.jsonl" \ --num-prompts -1 --spec-bench-category "summarization" ``` #### Other HuggingFaceDataset Examples ```bash vllm serve Qwen/Qwen2-VL-7B-Instruct ``` `lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path lmms-lab/LLaVA-OneVision-Data \ --hf-split train \ --hf-subset "chart2text(cauldron)" \ --num-prompts 10 ``` `Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \ --hf-split train \ --num-prompts 10 ``` `AI-MO/aimo-validation-aime`: ``` bash vllm bench serve \ --model Qwen/QwQ-32B \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --num-prompts 10 \ --seed 42 ``` `philschmid/mt-bench`: ``` bash vllm bench serve \ --model Qwen/QwQ-32B \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts 80 ``` `vdaita/edit_5k_char` or `vdaita/edit_10k_char`: ``` bash vllm bench serve \ --model Qwen/QwQ-32B \ --dataset-name hf \ --dataset-path vdaita/edit_5k_char \ --num-prompts 90 \ --blazedit-min-distance 0.01 \ --blazedit-max-distance 0.99 ``` #### Running With Sampling Parameters When using OpenAI-compatible backends such as `vllm`, optional sampling parameters can be specified. Example client command: ```bash vllm bench serve \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --endpoint /v1/completions \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json \ --top-k 10 \ --top-p 0.9 \ --temperature 0.5 \ --num-prompts 10 ``` #### Running With Ramp-Up Request Rate The benchmark tool also supports ramping up the request rate over the duration of the benchmark run. This can be useful for stress testing the server or finding the maximum throughput that it can handle, given some latency budget. Two ramp-up strategies are supported: - `linear`: Increases the request rate linearly from a start value to an end value. - `exponential`: Increases the request rate exponentially. The following arguments can be used to control the ramp-up: - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark. #### Load Pattern Configuration vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior: ##### Load Pattern Control Parameters - `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation. - `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic. - `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure. These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints. The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where: - Shape parameter: `burstiness` value - Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$ - Traffic characteristics: - `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing - `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation - `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing ![Load Pattern Examples](../assets/contributing/load-pattern-examples.png) *Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.* Load Pattern Recommendations by Use Case: | Use Case | Burstiness | Request Rate | Max Concurrency | Description | | --- | --- | --- | --- | --- | | Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand | | Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance | | Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience | | Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis | | Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints | | SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing | These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions. The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where: - Users send requests as fast as they can (infinite rate) - A load balancer or API gateway controls the maximum concurrent connections - The system operates at its concurrency limit, revealing true throughput capacity - `--burstiness` has no effect since request timing is not controlled when rate is infinite This pattern helps determine optimal concurrency settings for your production load balancer configuration. To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters: ```text GPU KV cache size: 15,728,640 tokens Maximum concurrency for 8,192 tokens per request: 1920 ``` Where: - GPU KV cache size: Total tokens that can be cached across all concurrent requests - Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len` - Calculation: `max_concurrency = kv_cache_size / max_model_len` Using KV cache metrics for load pattern configuration: - For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints - For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity - For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates - Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths

### 📈 Offline Throughput Benchmark

```bash vllm bench throughput \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset-name sonnet \ --dataset-path vllm/benchmarks/sonnet.txt \ --num-prompts 10 ``` If successful, you will see the following output ```text Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Total num prompt tokens: 5014 Total num output tokens: 1500 ``` #### VisionArena Benchmark for Vision Language Models ```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --num-prompts 1000 \ --hf-split train ``` The `num prompt tokens` now includes image token counts ```text Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Total num prompt tokens: 14527 Total num output tokens: 1280 ``` #### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_WORKER_MULTIPROC_METHOD=spawn \ vllm bench throughput \ --dataset-name=hf \ --dataset-path=likaixin/InstructCoder \ --model=meta-llama/Meta-Llama-3-8B-Instruct \ --input-len=1000 \ --output-len=100 \ --num-prompts=2048 \ --async-engine \ --speculative-config $'{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}' ``` ```text Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Total num prompt tokens: 261136 Total num output tokens: 204800 ``` #### Other HuggingFaceDataset Examples `lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path lmms-lab/LLaVA-OneVision-Data \ --hf-split train \ --hf-subset "chart2text(cauldron)" \ --num-prompts 10 ``` `Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \ --hf-split train \ --num-prompts 10 ``` `AI-MO/aimo-validation-aime`: ```bash vllm bench throughput \ --model Qwen/QwQ-32B \ --backend vllm \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --hf-split train \ --num-prompts 10 ``` Benchmark with LoRA adapters: ``` bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench throughput \ --model meta-llama/Llama-2-7b-hf \ --backend vllm \ --dataset_path /ShareGPT_V3_unfiltered_cleaned_split.json \ --dataset_name sharegpt \ --num-prompts 10 \ --max-loras 2 \ --max-lora-rank 8 \ --enable-lora \ --lora-path yard1/llama-2-7b-sql-lora-test ```

### 🛠️ Structured Output Benchmark

Benchmark the performance of structured output generation (JSON, grammar, regex). #### Server Setup ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B ``` #### JSON Schema Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset json \ --structured-output-ratio 1.0 \ --request-rate 10 \ --num-prompts 1000 ``` #### Grammar-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset grammar \ --structure-type grammar \ --request-rate 10 \ --num-prompts 1000 ``` #### Regex-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset regex \ --request-rate 10 \ --num-prompts 1000 ``` #### Choice-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset choice \ --request-rate 10 \ --num-prompts 1000 ``` #### XGrammar Benchmark Dataset ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset xgrammar_bench \ --request-rate 10 \ --num-prompts 1000 ```

### 📚 Long Document QA Benchmark

Benchmark the performance of long document question-answering with prefix caching. #### Basic Long Document QA Test ```bash python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 16 \ --document-length 2000 \ --output-len 50 \ --repeat-count 5 ``` #### Different Repeat Modes ```bash # Random mode (default) - shuffle prompts randomly python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode random # Tile mode - repeat entire prompt list in sequence python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode tile # Interleave mode - repeat each prompt consecutively python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode interleave ```

### 🗂️ Prefix Caching Benchmark

Benchmark the efficiency of automatic prefix caching. #### Fixed Prompt with Prefix Caching ```bash python3 benchmarks/benchmark_prefix_caching.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-prompts 1 \ --repeat-count 100 \ --input-length-range 128:256 ``` #### ShareGPT Dataset with Prefix Caching ```bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json python3 benchmarks/benchmark_prefix_caching.py \ --model meta-llama/Llama-2-7b-chat-hf \ --dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \ --enable-prefix-caching \ --num-prompts 20 \ --repeat-count 5 \ --input-length-range 128:256 ``` ##### Prefix Repetition Dataset ```bash vllm bench serve \ --backend openai \ --model meta-llama/Llama-2-7b-chat-hf \ --dataset-name prefix_repetition \ --num-prompts 100 \ --prefix-repetition-prefix-len 512 \ --prefix-repetition-suffix-len 128 \ --prefix-repetition-num-prefixes 5 \ --prefix-repetition-output-len 128 ```

### 🧪 Hashing Benchmarks

Two helper scripts live in `benchmarks/` to compare hashing options used by prefix caching and related utilities. They are standalone (no server required) and help choose a hash algorithm before enabling prefix caching in production. - `benchmarks/benchmark_hash.py`: Micro-benchmark that measures per-call latency of three implementations on a representative `(bytes, tuple[int])` payload. ```bash python benchmarks/benchmark_hash.py --iterations 20000 --seed 42 ``` - `benchmarks/benchmark_prefix_block_hash.py`: End-to-end block hashing benchmark that runs the full prefix-cache hash pipeline (`hash_block_tokens`) across many fake blocks and reports throughput. ```bash python benchmarks/benchmark_prefix_block_hash.py --num-blocks 20000 --block-size 32 --trials 5 ``` Supported algorithms: `sha256`, `sha256_cbor`, `xxhash`, `xxhash_cbor`. Install optional deps to exercise all variants: ```bash uv pip install xxhash cbor2 ``` If an algorithm’s dependency is missing, the script will skip it and continue.

### ⚡ Request Prioritization Benchmark

Benchmark the performance of request prioritization in vLLM. #### Basic Prioritization Test ```bash python3 benchmarks/benchmark_prioritization.py \ --model meta-llama/Llama-2-7b-chat-hf \ --input-len 128 \ --output-len 64 \ --num-prompts 100 \ --scheduling-policy priority ``` #### Multiple Sequences per Prompt ```bash python3 benchmarks/benchmark_prioritization.py \ --model meta-llama/Llama-2-7b-chat-hf \ --input-len 128 \ --output-len 64 \ --num-prompts 100 \ --scheduling-policy priority \ --n 2 ```

### 👁️ Multi-Modal Benchmark

Benchmark the performance of multi-modal requests in vLLM. #### Images (ShareGPT4V) Start vLLM: ```bash vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --allowed-local-media-path /path/to/sharegpt4v/images ``` Send requests with images: ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --dataset-name sharegpt \ --dataset-path /path/to/ShareGPT4V/sharegpt4v_instruct_gpt4-vision_cap100k.json \ --num-prompts 100 \ --save-result \ --result-dir ~/vllm_benchmark_results \ --save-detailed \ --endpoint /v1/chat/completions ``` #### Videos (ShareGPT4Video) Start vLLM: ```bash vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"video": 1}' \ --allowed-local-media-path /path/to/sharegpt4video/videos ``` Send requests with videos: ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --dataset-name sharegpt \ --dataset-path /path/to/ShareGPT4Video/llava_v1_5_mix665k_with_video_chatgpt72k_share4video28k.json \ --num-prompts 100 \ --save-result \ --result-dir ~/vllm_benchmark_results \ --save-detailed \ --endpoint /v1/chat/completions ``` #### Synthetic Random Images (random-mm) Generate synthetic image inputs alongside random text prompts to stress-test vision models without external datasets. Notes: - Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`. - Video sampling is not yet implemented. Start the server (example): ```bash vllm serve Qwen/Qwen2.5-VL-3B-Instruct \ --dtype bfloat16 \ --max-model-len 16384 \ --limit-mm-per-prompt '{"image": 3, "video": 0}' \ --mm-processor-kwargs max_pixels=1003520 ``` Benchmark. It is recommended to use the flag `--ignore-eos` to simulate real responses. You can set the size of the output via the arg `random-output-len`. Ex.1: Fixed number of items and a single image resolution, enforcing generation of approx 40 tokens: ```bash vllm bench serve \ --backend openai-chat \ --model Qwen/Qwen2.5-VL-3B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name random-mm \ --num-prompts 100 \ --max-concurrency 10 \ --random-prefix-len 25 \ --random-input-len 300 \ --random-output-len 40 \ --random-range-ratio 0.2 \ --random-mm-base-items-per-request 2 \ --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \ --random-mm-bucket-config '{(224, 224, 1): 1.0}' \ --request-rate inf \ --ignore-eos \ --seed 42 ``` The number of items per request can be controlled by passing multiple image buckets: ```bash --random-mm-base-items-per-request 2 \ --random-mm-num-mm-items-range-ratio 0.5 \ --random-mm-limit-mm-per-prompt '{"image": 4, "video": 0}' \ --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}' \ ``` Flags specific to `random-mm`: - `--random-mm-base-items-per-request`: base number of multimodal items per request. - `--random-mm-num-mm-items-range-ratio`: vary item count uniformly in the closed integer range [floor(n·(1−r)), ceil(n·(1+r))]. Set r=0 to keep it fixed; r=1 allows 0 items. - `--random-mm-limit-mm-per-prompt`: per-modality hard caps, e.g. '{"image": 3, "video": 0}'. - `--random-mm-bucket-config`: dict mapping (H, W, T) → probability. Entries with probability 0 are removed; remaining probabilities are renormalized to sum to 1. Use T=1 for images. Set any T>1 for videos (video sampling not yet supported). Behavioral notes: - If the requested base item count cannot be satisfied under the provided per-prompt limits, the tool raises an error rather than silently clamping. How sampling works: - Determine per-request item count k by sampling uniformly from the integer range defined by `--random-mm-base-items-per-request` and `--random-mm-num-mm-items-range-ratio`, then clamp k to at most the sum of per-modality limits. - For each of the k items, sample a bucket (H, W, T) according to the normalized probabilities in `--random-mm-bucket-config`, while tracking how many items of each modality have been added. - If a modality (e.g., image) reaches its limit from `--random-mm-limit-mm-per-prompt`, all buckets of that modality are excluded and the remaining bucket probabilities are renormalized before continuing. This should be seen as an edge case, and if this behavior can be avoided by setting `--random-mm-limit-mm-per-prompt` to a large number. Note that this might result in errors due to engine config `--limit-mm-per-prompt`. - The resulting request contains synthetic image data in `multi_modal_data` (OpenAI Chat format). When `random-mm` is used with the OpenAI Chat backend, prompts remain text and MM content is attached via `multi_modal_data`.

### Embedding Benchmark Benchmark the performance of embedding requests in vLLM.

#### Text Embeddings Unlike generative models which use Completions API or Chat Completions API, you should set `--backend openai-embeddings` and `--endpoint /v1/embeddings` to use the Embeddings API. You can use any text dataset to benchmark the model, such as ShareGPT. Start the server: ```bash vllm serve jinaai/jina-embeddings-v3 --trust-remote-code ``` Run the benchmark: ```bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --model jinaai/jina-embeddings-v3 \ --backend openai-embeddings \ --endpoint /v1/embeddings \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json ``` #### Multi-modal Embeddings Unlike generative models which use Completions API or Chat Completions API, you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backend to use depends on the model: - CLIP: `--backend openai-embeddings-clip` - VLM2Vec: `--backend openai-embeddings-vlm2vec` For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format. You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it. For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings. Serve and benchmark CLIP: ```bash # Run this in another process vllm serve openai/clip-vit-base-patch32 # Run these one by one after the server is up # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --model openai/clip-vit-base-patch32 \ --backend openai-embeddings-clip \ --endpoint /v1/embeddings \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --model openai/clip-vit-base-patch32 \ --backend openai-embeddings-clip \ --endpoint /v1/embeddings \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat ``` Serve and benchmark VLM2Vec: ```bash # Run this in another process vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \ --trust-remote-code \ --chat-template examples/template_vlm2vec_phi3v.jinja # Run these one by one after the server is up # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --model TIGER-Lab/VLM2Vec-Full \ --backend openai-embeddings-vlm2vec \ --endpoint /v1/embeddings \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve \ --model TIGER-Lab/VLM2Vec-Full \ --backend openai-embeddings-vlm2vec \ --endpoint /v1/embeddings \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat ```

### Reranker Benchmark Benchmark the performance of rerank requests in vLLM.

Unlike generative models which use Completions API or Chat Completions API, you should set `--backend vllm-rerank` and `--endpoint /v1/rerank` to use the Reranker API. For reranking, the only supported dataset is `--dataset-name random-rerank` Start the server: ```bash vllm serve BAAI/bge-reranker-v2-m3 ``` Run the benchmark: ```bash vllm bench serve \ --model BAAI/bge-reranker-v2-m3 \ --backend vllm-rerank \ --endpoint /v1/rerank \ --dataset-name random-rerank \ --tokenizer BAAI/bge-reranker-v2-m3 \ --random-input-len 512 \ --num-prompts 10 \ --random-batch-size 5 ``` For reranker models, this will create `num_prompts / random_batch_size` requests with `random_batch_size` "documents" where each one has close to `random_input_len` tokens. In the example above, this results in 2 rerank requests with 5 "documents" each where each document has close to 512 tokens. Please note that the `/v1/rerank` is also supported by embedding models. So if you're running with an embedding model, also set `--no_reranker`. Because in this case the query is treated as an individual prompt by the server, here we send `random_batch_size - 1` documents to account for the extra prompt which is the query. The token accounting to report the throughput numbers correctly is also adjusted.

--- # Performance Dashboard The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads. It is updated by triggering benchmark runs on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM. The results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm). ## Manually Trigger the benchmark Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite. For x86 CPU environment, please use the image with "-cpu" postfix. For AArch64 CPU environment, please use the image with "-arm64-cpu" postfix. Here is an example for docker run command for CPU. For GPUs skip setting the `ON_CPU` env var. ```bash export VLLM_COMMIT=1da94e673c257373280026f75ceb4effac80e892 # use full commit hash from the main branch export HF_TOKEN= if [[ "$(uname -m)" == aarch64 || "$(uname -m)" == arm64 ]]; then IMG_SUFFIX="arm64-cpu" else IMG_SUFFIX="cpu" fi docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN=$HF_TOKEN -e ON_ARM64_CPU=1 --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}-${IMG_SUFFIX} ``` Then, run below command inside the docker instance. ```bash bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh ``` When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json. ### Runtime environment variables - `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file). - `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string. - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. ### Visualization The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table with real benchmarking results. You can find the result presented as a table inside the `buildkite/performance-benchmark` job page. If you do not see the table, please wait till the benchmark finish running. The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file. The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking. #### Performance Results Comparison The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`. When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`. `compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT. If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead. Here is an example using the script to compare result_a and result_b with max concurrency and qps for same Model, Dataset name, input/output length. `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` ***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]*** | | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio | |----|------|-----|-----------|----------|----------| | 0 | 12 | inf | 24.98 | 186.03 | 7.45 | | 1 | 16 | inf| 25.49 | 246.92 | 9.69 | | 2 | 24 | inf| 27.74 | 293.34 | 10.57 | | 3 | 32 | inf| 28.61 |306.69 | 10.72 | ***compare-json-results.py – Command-Line Parameters*** compare-json-results.py provides configurable parameters to compare one or more benchmark_results.json files and generate summary tables and plots. In most cases, users only need to specify --file to parse the desired benchmark results. | Parameter | Type | Default Value | Description | | ---------------------- | ------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------- | | `--file` | `str` (appendable) | *None* | Input JSON result file(s). Can be specified multiple times to compare multiple benchmark outputs. | | `--debug` | `bool` | `False` | Enables debug mode. When set, prints all available information to aid troubleshooting and validation. | | `--plot` / `--no-plot` | `bool` | `True` | Controls whether performance plots are generated. Use `--no-plot` to disable graph generation. | | `--xaxis` | `str` | `# of max concurrency.` | Column name used as the X-axis in comparison plots (for example, concurrency or batch size). | | `--latency` | `str` | `p99` | Latency aggregation method used for TTFT/TPOT. Supported values: `median` or `p99`. | | `--ttft-max-ms` | `float` | `3000.0` | Reference upper bound (milliseconds) for TTFT plots, typically used to visualize SLA thresholds. | | `--tpot-max-ms` | `float` | `100.0` | Reference upper bound (milliseconds) for TPOT plots, typically used to visualize SLA thresholds. | ***Valid Max Concurrency Summary*** Based on the configured TTFT and TPOT SLA thresholds, compare-json-results.py computes the maximum valid concurrency for each benchmark result. The “Max # of max concurrency. (Both)” column represents the highest concurrency level that satisfies both TTFT and TPOT constraints simultaneously. This value is typically used in capacity planning and sizing guides. | # | Configuration | Max # of max concurrency. (TTFT ≤ 10000 ms) | Max # of max concurrency. (TPOT ≤ 100 ms) | Max # of max concurrency. (Both) | Output Tput @ Both (tok/s) | TTFT @ Both (ms) | TPOT @ Both (ms) | | - | -------------- | ------------------------------------------- | ----------------------------------------- | -------------------------------- | -------------------------- | ---------------- | ---------------- | | 0 | results-a | 128.00 | 12.00 | 12.00 | 127.76 | 3000.82 | 93.24 | | 1 | results-b | 128.00 | 32.00 | 32.00 | 371.42 | 2261.53 | 81.74 | More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md). ## Continuous Benchmarking The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements. ### How It Works The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests: - **Serving tests**: Measure request handling and API performance - **Throughput tests**: Evaluate token generation rates - **Latency tests**: Assess response time characteristics ### Benchmark Configuration The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking: 1. Navigate to the appropriate GPU directory in the benchmarks configuration 2. Add your model specifications to the corresponding configuration files 3. The new models will be included in the next scheduled benchmark run --- # Parameter Sweeps ## Online Benchmark ### Basic `vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations. Follow these steps to run the script: 1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option. 2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option. 3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`. - Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`: ```json [ { "max_num_seqs": 32, "max_num_batched_tokens": 1024 }, { "max_num_seqs": 64, "max_num_batched_tokens": 1024 }, { "max_num_seqs": 64, "max_num_batched_tokens": 2048 }, { "max_num_seqs": 128, "max_num_batched_tokens": 2048 }, { "max_num_seqs": 128, "max_num_batched_tokens": 4096 }, { "max_num_seqs": 256, "max_num_batched_tokens": 4096 } ] ``` 4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`. - Example: Using different input/output lengths for random dataset: ```json [ { "random_input_len": 128, "random_output_len": 32 }, { "random_input_len": 256, "random_output_len": 64 }, { "random_input_len": 512, "random_output_len": 128 } ] ``` 5. Determine where you want to save the results, and pass that to `--output-dir`. Example command: ```bash vllm bench sweep serve \ --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \ --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \ --serve-params benchmarks/serve_hparams.json \ --bench-params benchmarks/bench_hparams.json \ -o benchmarks/results ``` !!! important If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them. You can use `--dry-run` to preview the commands to be run. We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`. Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run. In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`. !!! note By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`. !!! tip You can use the `--resume` option to continue the parameter sweep if one of the runs failed. ### SLA auto-tuner `vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`. For example, to ensure E2E latency within different target values for 99% of requests: ```json [ { "p99_e2el_ms": "<=200" }, { "p99_e2el_ms": "<=500" }, { "p99_e2el_ms": "<=1000" }, { "p99_e2el_ms": "<=2000" } ] ``` Example command: ```bash vllm bench sweep serve_sla \ --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \ --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \ --serve-params benchmarks/serve_hparams.json \ --bench-params benchmarks/bench_hparams.json \ --sla-params benchmarks/sla_hparams.json \ --sla-variable max_concurrency \ -o benchmarks/results ``` The algorithm for adjusting the SLA variable is as follows: 1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable. - For example, the initial request rate is set to the concurrency under infinite QPS. 2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied. 3. Apply binary search over the window to find the maximum value that still satisfies the SLA. !!! important SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`. For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value. ## Visualization ### Basic `vllm bench sweep plot` can be used to plot performance curves from parameter sweep results. Example command: ```bash vllm bench sweep plot benchmarks/results/ \ --var-x max_concurrency \ --row-by random_input_len \ --col-by random_output_len \ --curve-by api_server_count,max_num_batched_tokens \ --filter-by 'max_concurrency<=1024' ``` !!! tip You can use `--dry-run` to preview the figures to be plotted. ### Pareto chart `vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput. Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs. - x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`). - y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP). - Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`. - Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`). Example: ```bash vllm bench sweep plot_pareto benchmarks/results/ \ --label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size ``` --- # vllm bench latency ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_latency.inc.md" --- # vllm bench serve ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_serve.inc.md" --- # vllm bench sweep plot ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_sweep_plot.inc.md" --- # vllm bench sweep plot_pareto ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_sweep_plot_pareto.inc.md" --- # vllm bench sweep serve ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_sweep_serve.inc.md" --- # vllm bench sweep serve_sla ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_sweep_serve_sla.inc.md" --- # vllm bench throughput ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/bench_throughput.inc.md" --- # vllm chat ## Arguments --8<-- "docs/generated/argparse/chat.inc.md" --- # vllm complete ## Arguments --8<-- "docs/generated/argparse/complete.inc.md" --- When passing JSON CLI arguments, the following sets of arguments are equivalent: - `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'` - `--json-arg.key1 value1 --json-arg.key2.key3 value2` Additionally, list elements can be passed individually using `+`: - `--json-arg '{"key4": ["value3", "value4", "value5"]}'` - `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'` --- # vllm run-batch ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/run-batch.inc.md" --- # vllm serve ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" ## Arguments --8<-- "docs/generated/argparse/serve.inc.md" --- # Meetups We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please visit [vllm.ai/events](https://vllm.ai/events) to learn more. --- # Sponsors vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support! Please visit [vllm.ai/#sponsors](https://vllm.ai/#sponsors) to learn more. --- # Conserving Memory Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem. ## Tensor Parallelism (TP) Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs. The following code splits the model across 2 GPUs. ```python from vllm import LLM llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2) ``` !!! warning To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][]) before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`. To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable. !!! note With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. ## Quantization Quantized models take less memory at the cost of lower precision. Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI)) and used directly without extra configuration. Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details. ## Context length and batch size You can further reduce memory usage by limiting the context length of the model (`max_model_len` option) and the maximum batch size (`max_num_seqs` option). ```python from vllm import LLM llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2) ``` ## Reduce CUDA Graphs By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU. You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: ??? code ```python from vllm import LLM from vllm.config import CompilationConfig, CompilationMode llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", compilation_config=CompilationConfig( mode=CompilationMode.VLLM_COMPILE, # By default, it goes up to max_num_seqs cudagraph_capture_sizes=[1, 2, 4, 8, 16], ), ) ``` You can disable graph capturing completely via the `enforce_eager` flag: ```python from vllm import LLM llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True) ``` ## Adjust cache size If you run out of CPU RAM, try the following options: - (Multi-modal models only) you can set the size of multi-modal cache by setting `mm_processor_cache_gb` engine argument (default 4 GiB). - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB). ## Multi-modal input limits You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model: ```python from vllm import LLM # Accept up to 3 images and 1 video per prompt llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", limit_mm_per_prompt={"image": 3, "video": 1}, ) ``` You can go a step further and disable unused modalities completely by setting its limit to zero. For example, if your application only accepts image input, there is no need to allocate any memory for videos. ```python from vllm import LLM # Accept any number of images but no videos llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", limit_mm_per_prompt={"video": 0}, ) ``` You can even run a multi-modal model for text-only inference: ```python from vllm import LLM # Don't accept images. Just text. llm = LLM( model="google/gemma-3-27b-it", limit_mm_per_prompt={"image": 0}, ) ``` ### Configurable options `limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima. Configurable options by modality: - `image`: `{"count": int, "width": int, "height": int}` - `video`: `{"count": int, "num_frames": int, "width": int, "height": int}` - `audio`: `{"count": int, "length": int}` Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions]. Examples: ```python from vllm import LLM # Up to 5 images per prompt, profile with 512x512. # Up to 1 video per prompt, profile with 32 frames at 640x640. llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", limit_mm_per_prompt={ "image": {"count": 5, "width": 512, "height": 512}, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}, }, ) ``` For backward compatibility, passing an integer works as before and is interpreted as `{"count": }`. For example: - `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}` - You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}` !!! note - The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time. - If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning. !!! warning These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints. ## Multi-modal processor arguments For certain models, you can adjust the multi-modal processor arguments to reduce the size of the processed multi-modal inputs, which in turn saves memory. Here are some examples: ```python from vllm import LLM # Available for Qwen2-VL series models llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", mm_processor_kwargs={"max_pixels": 768 * 768}, # Default is 1280 * 28 * 28 ) # Available for InternVL series models llm = LLM( model="OpenGVLab/InternVL2-2B", mm_processor_kwargs={"max_dynamic_patch": 4}, # Default is 12 ) ``` --- --- toc_depth: 3 --- # Engine Arguments Engine arguments control the behavior of the vLLM engine. - For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class. - For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`. The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings. --8<-- "docs/cli/json_tip.inc.md" ## `EngineArgs` --8<-- "docs/generated/argparse/engine_args.inc.md" ## `AsyncEngineArgs` --8<-- "docs/generated/argparse/async_engine_args.inc.md" --- # Environment Variables vLLM uses the following environment variables to configure the system: !!! warning Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work. All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). ```python --8<-- "vllm/envs.py:env-vars-definition" ``` --- # Model Resolution vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository and finding the corresponding implementation that is registered to vLLM. Nevertheless, our model resolution may fail for the following reasons: - The `config.json` of the model repository lacks the `architectures` field. - Unofficial repositories refer to a model using alternative names which are not recorded in vLLM. - The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded. To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option. For example: ```python from vllm import LLM llm = LLM( model="cerebras/Cerebras-GPT-1.3B", hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2 ) ``` Our [list of supported models](../models/supported_models.md) shows the model architectures that are recognized by vLLM. --- # Optimization and Tuning This guide covers optimization strategies and performance tuning for vLLM V1. !!! tip Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory. ## Preemption Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests. In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes available again. When this occurs, you may see the following warning: ```text WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1 ``` While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency. If you frequently encounter preemptions, consider the following actions: - Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space. - Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space. - Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead. - Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties. You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`. In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture. ## Chunked Prefill Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations. In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it. This policy has two benefits: - It improves ITL and generation decode because decode requests are prioritized. - It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch. ### Performance Tuning with Chunked Prefill You can tune the performance by adjusting `max_num_batched_tokens`: - Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes. - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch. - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs. - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes). ```python from vllm import LLM # Set max_num_batched_tokens to tune performance llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384) ``` See related papers for more details ( or ). ## Parallelism Strategies vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations. ### Tensor Parallelism (TP) Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node. **When to use:** - When the model is too large to fit on a single GPU - When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput ```python from vllm import LLM # Split model across 4 GPUs llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4) ``` For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential. ### Pipeline Parallelism (PP) Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence. **When to use:** - When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes - For very deep and narrow models where layer distribution is more efficient than tensor sharding Pipeline parallelism can be combined with tensor parallelism for very large models: ```python from vllm import LLM # Combine pipeline and tensor parallelism llm = LLM( model="meta-llama/Llama-3.3-70B-Instruct, tensor_parallel_size=4, pipeline_parallel_size=2, ) ``` ### Expert Parallelism (EP) Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs. **When to use:** - Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4) - When you want to balance the expert computation load across GPUs Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers. It will use the same degree of parallelism as what you have set for tensor parallelism. ### Data Parallelism (DP) Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel. **When to use:** - When you have enough GPUs to replicate the entire model - When you need to scale throughput rather than model size - In multi-user environments where isolation between request batches is beneficial Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`. Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size. ### Batch-level DP for Multi-Modal Encoders By default, TP is used to shard the weights of multi-modal encoders just like for language decoders, in order to reduce the memory and compute load on each GPU. However, since the size of multi-modal encoders is very small compared to language decoders, there is relatively little gain from TP. On the other hand, TP incurs significant communication overhead because of all-reduce being performed after every layer. Given this, it may be advantageous to instead shard the batched input data using TP, essentially performing batch-level DP. This has been shown to improve the throughput and TTFT by around 10% for `tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations, batch-level DP can provide another 40% improvement compared to regular TP. Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank, there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already. You can enable batch-level DP by setting `mm_encoder_tp_mode="data"`, for example: ```python from vllm import LLM llm = LLM( model="Qwen/Qwen2.5-VL-72B-Instruct", tensor_parallel_size=4, # When mm_encoder_tp_mode="data", # the vision encoder uses TP=4 (not DP=1) to shard the input data, # so the TP size becomes the effective DP size. # Note that this is independent of the DP size for language decoder which is used in expert parallel setting. mm_encoder_tp_mode="data", # The language decoder uses TP=4 to shard the weights regardless # of the setting of mm_encoder_tp_mode ) ``` !!! important Batch-level DP is not to be confused with API request-level DP (which is instead controlled by `data_parallel_size`). Batch-level DP needs to be implemented on a per-model basis, and enabled by setting `supports_encoder_tp_data = True` in the model class. Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to use this feature. Known supported models (with corresponding benchmarks): - dots_ocr () - GLM-4.1V or above () - InternVL () - Kimi-VL () - Llama4 () - MiniCPM-V-2.5 or above (, ) - Qwen2-VL or above (, , ) - Step3 () ## Input Processing ### Parallel Processing You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing). This is useful when input processing (which is run inside the API server) becomes a bottleneck compared to model execution (which is run inside engine core) and you have excess CPU capacity. ```console # Run 4 API processes and 1 engine core process vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 # Run 4 API processes and 2 engine core processes vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 ``` !!! note API server scale-out is only available for online inference. !!! warning By default, 8 CPU threads are used in each API server to load media items (e.g. images) from request data. If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT` to avoid CPU resource exhaustion. !!! note API server scale-out disables [multi-modal IPC caching](#ipc-caching) because it requires a one-to-one correspondence between API and engine core processes. This does not impact [multi-modal processor caching](#processor-caching). ## Multi-Modal Caching Multi-modal caching avoids repeated transfer or processing of the same multi-modal data, which commonly occurs in multi-turn conversations. ### Processor Caching Multi-modal processor caching is automatically enabled to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalProcessor`. ### IPC Caching Multi-modal IPC caching is automatically enabled when there is a one-to-one correspondence between API (`P0`) and engine core (`P1`) processes, to avoid repeatedly transferring the same multi-modal inputs between them. #### Key-Replicated Cache By default, IPC caching uses a **key-replicated cache**, where cache keys exist in both the API (`P0`) and engine core (`P1`) processes, but the actual cache data resides only in `P1`. #### Shared Memory Cache When multiple worker processes are involved (e.g., when TP > 1), a **shared-memory cache** is more efficient. This can be enabled by setting `mm_processor_cache_type="shm"`. In this mode, cache keys are stored on `P0`, while the cache data itself lives in shared memory accessible by all processes. ### Configuration You can adjust the size of the cache by setting the value of `mm_processor_cache_gb` (default 4 GiB). If you do not benefit much from the cache, you can disable both IPC and processor caching completely via `mm_processor_cache_gb=0`. Examples: ```python # Use a larger cache llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", mm_processor_cache_gb=8, ) # Use a shared-memory based IPC cache llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", tensor_parallel_size=2, mm_processor_cache_type="shm", mm_processor_cache_gb=8, ) # Disable the cache llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", mm_processor_cache_gb=0, ) ``` ### Cache Placement Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows: | mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory | |-------------------|-------------|------------|------------|-------------|-------------| | lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` | | lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` | | shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` | | N/A | Disabled | N/A | N/A | N/A | `0` | K: Stores the hashes of multi-modal items V: Stores the processed tensor data of multi-modal items --- # Server Arguments The `vllm serve` command is used to launch the OpenAI-compatible server. ## CLI Arguments The `vllm serve` command is used to launch the OpenAI-compatible server. To see the available options, take a look at the [CLI Reference](../cli/README.md)! ## Configuration file You can load CLI arguments via a [YAML](https://yaml.org/) config file. The argument names must be the long form of those outlined [above](serve_args.md). For example: ```yaml # config.yaml model: meta-llama/Llama-3.1-8B-Instruct host: "127.0.0.1" port: 6379 uvicorn-log-level: "info" ``` To use the above config file: ```bash vllm serve --config config.yaml ``` !!! note In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence. The order of priorities is `command line > config file values > defaults`. e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file. --- # CI Failures What should I do when a CI job fails on my PR, but I don't think my PR caused the failure? - Check the dashboard of current CI test failures: 👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20) - If your failure **is already listed**, it's likely unrelated to your PR. Help fixing it is always welcome! - Leave comments with links to additional instances of the failure. - React with a 👍 to signal how many are affected. - If your failure **is not listed**, you should **file an issue**. ## Filing a CI Test Failure Issue - **File a bug report:** 👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml) - **Use this title format:** ```text [CI Failure]: failing-test-job - regex/matching/failing:test ``` - **For the environment field:** ```text Still failing on main as of commit abcdef123 ``` - **In the description, include failing tests:** ```text FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test2 - Failure description https://github.com/orgs/vllm-project/projects/20 https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml FAILED failing/test.py:failing_test3 - Failure description ``` - **Attach logs** (collapsible section example):

Logs:

```text ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data --- Logging error --- Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model return self.model_executor.execute_model(scheduler_output) ... FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test2 - Failure description FAILED failing/test.py:failing_test3 - Failure description ```

## Logs Wrangling Download the full log file from Buildkite locally. Strip timestamps and colorization: [.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh) ```bash ./ci-clean-log.sh ci.log ``` Use a tool [wl-clipboard](https://github.com/bugaevc/wl-clipboard) for quick copy-pasting: ```bash tail -525 ci_build.log | wl-copy ``` ## Investigating a CI Test Failure 1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main) 2. Bisect to find the first build that shows the issue. 3. Add your findings to the GitHub issue. 4. If you find a strong candidate PR, mention it in the issue and ping contributors. ## Reproducing a Failure CI test failures may be flaky. Use a bash loop to run repeatedly: [.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh) ```bash ./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp] ``` ## Submitting a PR If you submit a PR to fix a CI failure: - Link the PR to the issue: Add `Closes #12345` to the PR description. - Add the `ci-failure` label: This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20). ## Other Resources - 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability) - 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main) ## Daily Triage Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to: - Identify recent test failures **on `main`**. - Exclude legitimate test failures on PRs. - (Optional) Ignore tests with 0% reliability. Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20). --- # Nightly Builds of vLLM Wheels vLLM maintains a per-commit wheel repository (commonly referred to as "nightly") at `https://wheels.vllm.ai` that provides pre-built wheels for every commit on the `main` branch since `v0.5.3`. This document explains how the nightly wheel index mechanism works. ## Build and Upload Process on CI ### Wheel Building Wheels are built in the `Release` pipeline (`.buildkite/release-pipeline.yaml`) after a PR is merged into the main branch, with multiple variants: - **Backend variants**: `cpu` and `cuXXX` (e.g., `cu129`, `cu130`). - **Architecture variants**: `x86_64` and `aarch64`. Each build step: 1. Builds the wheel in a Docker container. 2. Renames the wheel filename to use the correct manylinux tag (currently `manylinux_2_31`) for PEP 600 compliance. 3. Uploads the wheel to S3 bucket `vllm-wheels` under `/{commit_hash}/`. ### Index Generation After uploading each wheel, the `.buildkite/scripts/upload-wheels.sh` script: 1. **Lists all existing wheels** in the commit directory from S3 2. **Generates indices** using `.buildkite/scripts/generate-nightly-index.py`: - Parses wheel filenames to extract metadata (version, variant, platform tags). - Creates HTML index files (`index.html`) for PyPI compatibility. - Generates machine-readable `metadata.json` files. 3. **Uploads indices** to multiple locations (overriding existing ones): - `/{commit_hash}/` - Always uploaded for commit-specific access. - `/nightly/` - Only for commits on `main` branch (not PRs). - `/{version}/` - Only for release wheels (no `dev` in its version). !!! tip "Handling Concurrent Builds" The index generation script can handle multiple variants being built concurrently by always listing all wheels in the commit directory before generating indices, avoiding race conditions. ## Directory Structure The S3 bucket structure follows this pattern: ```text s3://vllm-wheels/ ├── {commit_hash}/ # Commit-specific wheels and indices │ ├── vllm-*.whl # All wheel files │ ├── index.html # Project list (default variant) │ ├── vllm/ │ │ ├── index.html # Package index (default variant) │ │ └── metadata.json # Metadata (default variant) │ ├── cu129/ # Variant subdirectory │ │ ├── index.html # Project list (cu129 variant) │ │ └── vllm/ │ │ ├── index.html # Package index (cu129 variant) │ │ └── metadata.json # Metadata (cu129 variant) │ ├── cu130/ # Variant subdirectory │ ├── cpu/ # Variant subdirectory │ └── .../ # More variant subdirectories ├── nightly/ # Latest main branch wheels (mirror of latest commit) └── {version}/ # Release version indices (e.g., 0.11.2) ``` All built wheels are stored in `/{commit_hash}/`, while different indices are generated and reference them. This avoids duplication of wheel files. For example, you can specify the following URLs to use different indices: - `https://wheels.vllm.ai/nightly/cu130` for the latest main branch wheels built with CUDA 13.0. - `https://wheels.vllm.ai/{commit_hash}` for wheels built at a specific commit (default variant). - `https://wheels.vllm.ai/0.12.0/cpu` for 0.12.0 release wheels built for CPU variant. Please note that not all variants are present on every commit. The available variants are subject to change over time, e.g., changing cu130 to cu131. ### Variant Organization Indices are organized by variant: - **Default variant**: Wheels without variant suffix (i.e., built with the current `VLLM_MAIN_CUDA_VERSION`) are placed in the root. - **Variant subdirectories**: Wheels with variant suffixes (e.g., `+cu130`, `.cpu`) are organized in subdirectories. - **Alias to default**: The default variant can have an alias (e.g., `cu129` for now) for consistency and convenience. The variant is extracted from the wheel filename (as described in the [file name convention](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-name-convention)): - The variant is encoded in the local version identifier (e.g. `+cu129` or `dev+g.cu130`). - Examples: - `vllm-0.11.2.dev278+gdbc3d9991-cp38-abi3-manylinux1_x86_64.whl` → default variant - `vllm-0.10.2rc2+cu129-cp38-abi3-manylinux2014_aarch64.whl` → `cu129` variant - `vllm-0.11.1rc8.dev14+gaa384b3c0.cu130-cp38-abi3-manylinux1_x86_64.whl` → `cu130` variant ## Index Generation Details The `generate-nightly-index.py` script performs the following: 1. **Parses wheel filenames** using regex to extract: - Package name - Version (with variant extracted) - Python tag, ABI tag, platform tag - Build tag (if present) 2. **Groups wheels by variant**, then by package name: - Currently only `vllm` is built, but the structure supports multiple packages in the future. 3. **Generates HTML indices** (compliant with the [Simple repository API](https://packaging.python.org/en/latest/specifications/simple-repository-api/#simple-repository-api)): - Top-level `index.html`: Lists all packages and variant subdirectories - Package-level `index.html`: Lists all wheel files for that package - Uses relative paths to wheel files for portability 4. **Generates metadata.json**: - Machine-readable JSON containing all wheel metadata - Includes `path` field with URL-encoded relative path to wheel file - Used by `setup.py` to locate compatible pre-compiled wheels during Python-only builds ### Special Handling for AWS Services The wheels and indices are directly stored on AWS S3, and we use AWS CloudFront as a CDN in front of the S3 bucket. Since S3 does not provide proper directory listing, to support PyPI-compatible simple repository API behavior, we deploy a CloudFront Function that: - redirects any URL that does not end with `/` and does not look like a file (i.e., does not contain a dot `.` in the last path segment) to the same URL with a trailing `/` - appends `/index.html` to any URL that ends with `/` For example, the following requests would be handled as: - `/nightly` -> `/nightly/index.html` - `/nightly/cu130/` -> `/nightly/cu130/index.html` - `/nightly/index.html` or `/nightly/vllm.whl` -> unchanged !!! note "AWS S3 Filename Escaping" S3 will automatically escape filenames upon upload according to its [naming rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html). The direct impact on vllm is that `+` in filenames will be converted to `%2B`. We take special care in the index generation script to escape filenames properly when generating the HTML indices and JSON metadata, to ensure the URLs are correct and can be directly used. ## Usage of precompiled wheels in `setup.py` {#precompiled-wheels-usage} When installing vLLM with `VLLM_USE_PRECOMPILED=1`, the `setup.py` script: 1. **Determines wheel location** via `precompiled_wheel_utils.determine_wheel_url()`: - Env var `VLLM_PRECOMPILED_WHEEL_LOCATION` (user-specified URL/path) always takes precedence and skips all other steps. - Determines the variant from `VLLM_MAIN_CUDA_VERSION` (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_VARIANT`); the default variant will also be tried as a fallback. - Determines the _base commit_ (explained later) of this branch (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_COMMIT`). 2. **Fetches metadata** from `https://wheels.vllm.ai/{commit}/vllm/metadata.json` (for the default variant) or `https://wheels.vllm.ai/{commit}/{variant}/vllm/metadata.json` (for a specific variant). 3. **Selects compatible wheel** based on: - Package name (`vllm`) - Platform tag (architecture match) 4. **Downloads and extracts** precompiled binaries from the wheel: - C++ extension modules (`.so` files) - Flash Attention Python modules - Triton kernel Python files 5. **Patches package_data** to include extracted files in the installation !!! note "What is the base commit?" The base commit is determined by finding the merge-base between the current branch and upstream `main`, ensuring compatibility between source code and precompiled binaries. _Note: it's users' responsibility to ensure there is no native code (e.g., C++ or CUDA) changes before using precompiled wheels._ ## Implementation Files Key files involved in the nightly wheel mechanism: - **`.buildkite/release-pipeline.yaml`**: CI pipeline that builds wheels - **`.buildkite/scripts/upload-wheels.sh`**: Script that uploads wheels and generates indices - **`.buildkite/scripts/generate-nightly-index.py`**: Python script that generates PyPI-compatible indices - **`setup.py`**: Contains `precompiled_wheel_utils` class for fetching and using precompiled wheels --- # Update PyTorch version on vLLM OSS CI/CD vLLM's current policy is to always use the latest PyTorch stable release in CI/CD. It is standard practice to submit a PR to update the PyTorch version as early as possible when a new [PyTorch stable release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available. This process is non-trivial due to the gap between PyTorch releases. Using as an example, this document outlines common steps to achieve this update along with a list of potential issues and how to address them. ## Test PyTorch release candidates (RCs) Updating PyTorch in vLLM after the official release is not ideal because any issues discovered at that point can only be resolved by waiting for the next release or by implementing hacky workarounds in vLLM. The better solution is to test vLLM with PyTorch release candidates (RC) to ensure compatibility before each release. PyTorch release candidates can be downloaded from [PyTorch test index](https://download.pytorch.org/whl/test). For example, `torch2.7.0+cu12.8` RC can be installed using the following command: ```bash uv pip install torch torchvision torchaudio \ --index-url https://download.pytorch.org/whl/test/cu128 ``` When the final RC is ready for testing, it will be announced to the community on the [PyTorch dev-discuss forum](https://dev-discuss.pytorch.org/c/release-announcements). After this announcement, we can begin testing vLLM integration by drafting a pull request following this 3-step process: 1. Update [requirements files](https://github.com/vllm-project/vllm/tree/main/requirements) to point to the new releases for `torch`, `torchvision`, and `torchaudio`. 2. Use the following option to get the final release candidates' wheels. Some common platforms are `cpu`, `cu128`, and `rocm6.2.4`. ```bash --extra-index-url https://download.pytorch.org/whl/test/ ``` 3. Since vLLM uses `uv`, ensure the following index strategy is applied: - Via environment variable: ```bash export UV_INDEX_STRATEGY=unsafe-best-match ``` - Or via CLI flag: ```bash --index-strategy unsafe-best-match ``` If failures are found in the pull request, raise them as issues on vLLM and cc the PyTorch release team to initiate discussion on how to address them. ## Update CUDA version The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, torch `2.7.1+cu126`) is uploaded to PyPI. However, vLLM may require a different CUDA version, such as 12.8 for Blackwell support. This complicates the process as we cannot use the out-of-the-box `pip install torch torchvision torchaudio` command. The solution is to use `--extra-index-url` in vLLM's Dockerfiles. - Important indexes at the moment include: | Platform | `--extra-index-url` | |----------|-----------------| | CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)| | CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)| | ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) | | ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) | | XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) | - Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI. - `.buildkite/release-pipeline.yaml` - `.buildkite/scripts/upload-wheels.sh` ## Manually running vLLM builds on BuildKiteCI When building vLLM with a new PyTorch/CUDA version, the vLLM sccache S3 bucket will not have any cached artifacts, which can cause CI build jobs to exceed 5 hours. Furthermore, vLLM's fastcheck pipeline operates in read-only mode and does not populate the cache, making it ineffective for cache warm-up purposes. To address this, manually trigger a build on Buildkite to accomplish two objectives: 1. Run the complete test suite against the PyTorch RC build by setting the environment variables: `RUN_ALL=1` and `NIGHTLY=1` 2. Populate the vLLM sccache S3 bucket with compiled artifacts, enabling faster subsequent builds

Buildkite new build popup

## Update all the different vLLM platforms Rather than attempting to update all vLLM platforms in a single pull request, it's more manageable to handle some platforms separately. The separation of requirements and Dockerfiles for different platforms in vLLM CI/CD allows us to selectively choose which platforms to update. For instance, updating XPU requires the corresponding release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel. While updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm, completed the update for XPU. --- # Deprecation Policy This document outlines the official policy and process for deprecating features in the vLLM project. ## Overview vLLM uses a structured "deprecation pipeline" to guide the lifecycle of deprecated features. This policy ensures that users are given clear and sufficient notice when a feature is deprecated and that deprecations proceed in a consistent and predictable manner. We aim to strike a balance between continued innovation and respecting users’ reliance on existing functionality. Deprecations are tied to our **minor (Y) releases** following semantic versioning (X.Y.Z), where: - **X** is a major version (rare) - **Y** is a minor version (used for significant changes, including deprecations/removals) - **Z** is a patch version (used for fixes and safer enhancements) Features that fall under this policy include (at a minimum) the following: - CLI flags - Environment variables - Configuration files - APIs in the OpenAI-compatible API server - Public Python APIs for the `vllm` library ## Deprecation Pipeline The deprecation process consists of several clearly defined stages that span multiple Y releases: ### 1. Deprecated (Still On By Default) - **Action**: Feature is marked as deprecated. - **Timeline**: A removal version is explicitly stated in the deprecation warning (e.g., "This will be removed in v0.10.0"). - **Communication**: Deprecation is noted in the following, as applicable: - Help strings - Log output - API responses - `/metrics` output (for metrics features) - User-facing documentation - Release notes - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs ### 2.Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without re-enabling. - **Purpose**: Allows users who missed earlier warnings a temporary escape hatch while signaling imminent removal. Ensures any remaining usage is clearly surfaced and blocks silent breakage before full removal. ### 3. Removed - **Action**: Feature is completely removed from the codebase. - **Note**: Only features that have passed through the previous deprecation stages will be removed. ## Example Timeline Assume a feature is deprecated in `v0.9.0`. | Release | Status | |---------------|-------------------------------------------------------------------------------------------------| | `v0.9.0` | Feature is deprecated with clear removal version listed. | | `v0.10.0` | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. | | `v0.11.0` | Feature is removed. | ## Important Guidelines - **No Removals in Patch Releases**: Removing deprecated features in patch (`.Z`) releases is disallowed to avoid surprising users. - **Grace Period for Existing Deprecations**: Any feature deprecated **before this policy** will have its grace period start **now**, not retroactively. - **Documentation is Critical**: Ensure every stage of the pipeline is documented clearly for users. ## Final Notes This policy is a living document and may evolve as the needs of the project and its users change. Community feedback is welcome and encouraged as we refine the process. --- # Dockerfile We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible server with vLLM. More information about deploying with Docker can be found [here](../../deployment/docker.md). Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: - All build stages - The default build target (highlighted in grey) - External images (with dashed borders) The edges of the build graph represent: - `FROM ...` dependencies (with a solid line and a full arrow head) - `COPY --from=...` dependencies (with a dashed line and an empty arrow head) - `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head) >

> ![](../../assets/contributing/dockerfile-stages-dependency.png){ align="center" alt="query" width="100%" } > > > Made using: > > Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present): > > ```bash > dockerfilegraph \ > -o png \ > --legend \ > --dpi 200 \ > --max-label-length 50 \ > --filename docker/Dockerfile > ``` > > or in case you want to run it directly with the docker image: > > ```bash > docker run \ > --rm \ > --user "$(id -u):$(id -g)" \ > --workdir /workspace \ > --volume "$(pwd)":/workspace \ > ghcr.io/patrickhoefler/dockerfilegraph:alpine \ > --output png \ > --dpi 200 \ > --max-label-length 50 \ > --filename docker/Dockerfile \ > --legend > ``` > > (To run it for a different file, you can pass in a different argument to the flag `--filename`.) --- # Incremental Compilation Workflow When working on vLLM's C++/CUDA kernels located in the `csrc/` directory, recompiling the entire project with `uv pip install -e .` for every change can be time-consuming. An incremental compilation workflow using CMake allows for faster iteration by only recompiling the necessary components after an initial setup. This guide details how to set up and use such a workflow, which complements your editable Python installation. ## Prerequisites Before setting up the incremental build: 1. **vLLM Editable Install:** Ensure you have vLLM installed from source in an editable mode. Using pre-compiled wheels for the initial editable setup can be faster, as the CMake workflow will handle subsequent kernel recompilations. ```console uv venv --python 3.12 --seed source .venv/bin/activate VLLM_USE_PRECOMPILED=1 uv pip install -U -e . --torch-backend=auto ``` 2. **CUDA Toolkit:** Verify that the NVIDIA CUDA Toolkit is correctly installed and `nvcc` is accessible in your `PATH`. CMake relies on `nvcc` to compile CUDA code. You can typically find `nvcc` in `$CUDA_HOME/bin/nvcc` or by running `which nvcc`. If you encounter issues, refer to the [official CUDA Toolkit installation guides](https://developer.nvidia.com/cuda-toolkit-archive) and vLLM's main [GPU installation documentation](../getting_started/installation/gpu.md#troubleshooting) for troubleshooting. The `CMAKE_CUDA_COMPILER` variable in your `CMakeUserPresets.json` should also point to your `nvcc` binary. 3. **Build Tools:** It is highly recommended to install `ccache` for fast rebuilds by caching compilation results (e.g., `sudo apt install ccache` or `conda install ccache`). Also, ensure the core build dependencies like `cmake` and `ninja` are installed. These are installable through `requirements/build.txt` or your system's package manager. ```console uv pip install -r requirements/build.txt --torch-backend=auto ``` ## Setting up the CMake Build Environment The incremental build process is managed through CMake. You can configure your build settings using a `CMakeUserPresets.json` file at the root of the vLLM repository. ### Generate `CMakeUserPresets.json` using the helper script To simplify the setup, vLLM provides a helper script that attempts to auto-detect your system's configuration (like CUDA path, Python environment, and CPU cores) and generates the `CMakeUserPresets.json` file for you. **Run the script:** Navigate to the root of your vLLM clone and execute the following command: ```console python tools/generate_cmake_presets.py ``` The script will prompt you if it cannot automatically determine certain paths (e.g., `nvcc` or a specific Python executable for your vLLM development environment). Follow the on-screen prompts. If an existing `CMakeUserPresets.json` is found, the script will ask for confirmation before overwriting it. **Force overwrite existing file:** To automatically overwrite an existing `CMakeUserPresets.json` without prompting, use the `--force-overwrite` flag: ```console python tools/generate_cmake_presets.py --force-overwrite ``` This is particularly useful in automated scripts or CI/CD environments where interactive prompts are not desired. After running the script, a `CMakeUserPresets.json` file will be created in the root of your vLLM repository. ### Example `CMakeUserPresets.json` Below is an example of what the generated `CMakeUserPresets.json` might look like. The script will tailor these values based on your system and any input you provide. ```json { "version": 6, "cmakeMinimumRequired": { "major": 3, "minor": 26, "patch": 1 }, "configurePresets": [ { "name": "release", "generator": "Ninja", "binaryDir": "${sourceDir}/cmake-build-release", "cacheVariables": { "CMAKE_CUDA_COMPILER": "/usr/local/cuda/bin/nvcc", "CMAKE_C_COMPILER_LAUNCHER": "ccache", "CMAKE_CXX_COMPILER_LAUNCHER": "ccache", "CMAKE_CUDA_COMPILER_LAUNCHER": "ccache", "CMAKE_BUILD_TYPE": "Release", "VLLM_PYTHON_EXECUTABLE": "/home/user/venvs/vllm/bin/python", "CMAKE_INSTALL_PREFIX": "${sourceDir}", "CMAKE_CUDA_FLAGS": "", "NVCC_THREADS": "4", "CMAKE_JOB_POOLS": "compile=32" } } ], "buildPresets": [ { "name": "release", "configurePreset": "release", "jobs": 32 } ] } ``` **What do the various configurations mean?** - `CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically. - `CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default. - `VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable. - `CMAKE_INSTALL_PREFIX: "${sourceDir}"`: Specifies that the compiled components should be installed back into your vLLM source directory. This is crucial for the editable install, as it makes the newly built kernels immediately available to your Python environment. - `CMAKE_JOB_POOLS` and `jobs` in build presets: Control the parallelism of the build. The script sets these based on the number of CPU cores detected on your system. - `binaryDir`: Specifies where the build artifacts will be stored (e.g., `cmake-build-release`). ## Building and Installing with CMake Once your `CMakeUserPresets.json` is configured: 1. **Initialize the CMake build environment:** This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir` ```console cmake --preset release ``` 2. **Build and install the vLLM components:** This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation. ```console cmake --build --preset release --target install ``` 3. **Make changes and repeat!** Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files. ```console cmake --build --preset release --target install ``` ## Verifying the Build After a successful build, you will find a populated build directory (e.g., `cmake-build-release/` if you used the `release` preset and the example configuration). ```console > ls cmake-build-release/ bin cmake_install.cmake _deps machete_generation.log build.ninja CPackConfig.cmake detect_cuda_compute_capabilities.cu marlin_generation.log _C.abi3.so CPackSourceConfig.cmake detect_cuda_version.cc _moe_C.abi3.so CMakeCache.txt ctest _flashmla_C.abi3.so moe_marlin_generation.log CMakeFiles cumem_allocator.abi3.so install_local_manifest.txt vllm-flash-attn ``` The `cmake --build ... --target install` command copies the compiled shared libraries (like `_C.abi3.so`, `_moe_C.abi3.so`, etc.) into the appropriate `vllm` package directory within your source tree. This updates your editable installation with the newly compiled kernels. ## Additional Tips - **Adjust Parallelism:** Fine-tune the `CMAKE_JOB_POOLS` in `configurePresets` and `jobs` in `buildPresets` in your `CMakeUserPresets.json`. Too many jobs can overload systems with limited RAM or CPU cores, leading to slower builds or system instability. Too few won't fully utilize available resources. - **Clean Builds When Necessary:** If you encounter persistent or strange build errors, especially after significant changes or switching branches, consider removing the CMake build directory (e.g., `rm -rf cmake-build-release`) and re-running the `cmake --preset` and `cmake --build` commands. - **Specific Target Builds:** For even faster iterations when working on a specific module, you can sometimes build a specific target instead of the full `install` target, though `install` ensures all necessary components are updated in your Python environment. Refer to CMake documentation for more advanced target management. --- # Basic Model This guide walks you through the steps to implement a basic vLLM model. ## 1. Bring your model code First, clone the PyTorch model code from the source repository. For instance, vLLM's [OPT model](../../../vllm/model_executor/models/opt.py) was adapted from HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file. !!! warning Make sure to review and adhere to the original code's copyright and licensing terms! ## 2. Make your code compatible with vLLM To ensure compatibility with vLLM, your model must meet the following requirements: ### Initialization Code All vLLM modules within the model must include a `prefix` argument in their constructor. This `prefix` is typically the full name of the module in the model's state dictionary and is crucial for: - Runtime support: vLLM's attention operators are registered in a model's state by their full names. Each attention operator must have a unique prefix as its layer name to avoid conflicts. - Non-uniform quantization support: A quantized checkpoint can selectively quantize certain layers while keeping others in full precision. By providing the `prefix` during initialization, vLLM can match the current layer's `prefix` with the quantization configuration to determine if the layer should be initialized in quantized mode. The initialization code should look like this: ??? code ```python from torch import nn from vllm.config import VllmConfig from vllm.attention.layer import Attention class MyAttention(nn.Module): def __init__(self, vllm_config: VllmConfig, prefix: str): super().__init__() self.attn = Attention(prefix=f"{prefix}.attn") class MyDecoderLayer(nn.Module): def __init__(self, vllm_config: VllmConfig, prefix: str): super().__init__() self.self_attn = MyAttention(prefix=f"{prefix}.self_attn") class MyModel(nn.Module): def __init__(self, vllm_config: VllmConfig, prefix: str): super().__init__() self.layers = nn.ModuleList( [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)] ) class MyModelForCausalLM(nn.Module): def __init__(self, vllm_config: VllmConfig, prefix: str = ""): super().__init__() self.model = MyModel(vllm_config, prefix=f"{prefix}.model") ``` ### Computation Code - Add a `embed_input_ids` method inside `MyModel` module that returns the text embeddings given `input_ids`. This is equivalent to directly calling the text embedding layer, but provides a unified interface in case `MyModel` is used within a composite multimodal model. ```python class MyModel(nn.Module): ... def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor: ... ``` - Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension. ```python def forward( self, input_ids: torch.Tensor, positions: torch.Tensor, intermediate_tensors: IntermediateTensors | None = None, inputs_embeds: torch.Tensor | None = None, ) -> torch.Tensor: ... ``` !!! note Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM. For reference, check out our [Llama implementation](../../../vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out [vllm/model_executor/models](../../../vllm/model_executor/models) for more examples. ## 3. (Optional) Implement tensor parallelism and quantization support If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. To do this, substitute your model's linear and embedding layers with their tensor-parallel versions. For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`. When it comes to the linear layers, we provide the following options to parallelize them: - `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving. - `RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer. - `ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer. - `MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices. - `QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices. Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization. ## 4. Implement the weight loading logic You now need to implement the `load_weights` method in your `*ForCausalLM` class. This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately. ## 5. Register your model See [this page](registration.md) for instructions on how to register your new model to be used by vLLM. ## Frequently Asked Questions ### How to support models with interleaving sliding windows? To support a model with interleaving sliding windows, we need to take care of the following details: - Make sure the model's `config.json` contains `layer_types`. - In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171). With these two steps, interleave sliding windows should work with the model. ### How to support models that use Mamba? We consider 3 different scenarios: 1. Models that use Mamba layers (either Mamba-1 or Mamba-2) but do not use attention layers. 2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers. 3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers. For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](../../../vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](../../../vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference. The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config. For the mamba layers themselves, please use the [`MambaMixer`](../../../vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](../../../vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes. The model should also be added to the `MODELS_CONFIG_MAP` dictionary in [vllm/model_executor/models/config.py](../../../vllm/model_executor/models/config.py) to ensure that the runtime defaults are optimized. For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](../../../vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](../../../vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together). These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol). For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](../../../vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](../../../vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively. Please follow the same guidelines as case (2) for implementing these models. We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention). For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`. It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers. Please see [`LinearAttentionMetadata`](../../../vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](../../../vllm/v1/attention/backends/short_conv_attn.py) for examples of this. It is also worth noting that we should update `MAMBA_TYPE_TO_BACKEND_MAP` and `MambaAttentionBackendEnum` in [`registry.py`](../../../vllm/attention/backends/registry.py) when adding a new mamba backend. Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it. Please see the calls to `direct_register_custom_op` in [vllm/model_executor/models/minimax_text_01.py](../../../vllm/model_executor/models/minimax_text_01.py) or [vllm/model_executor/layers/mamba/short_conv.py](../../../vllm/model_executor/layers/mamba/short_conv.py) for examples of this. The new custom op should then be added to the list `_attention_ops` in [vllm/config/compilation.py](../../../vllm/config/compilation.py) to ensure that piecewise CUDA graphs works as intended. --- # Multi-Modal Support This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md). ## 1. Update the base vLLM model It is assumed that you have already implemented the model in vLLM according to [these steps](basic.md). Further update the model as follows: - Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model. ??? code ```python class YourModelForImage2Seq(nn.Module): ... @classmethod def get_placeholder_str(cls, modality: str, i: int) -> str | None: if modality.startswith("image"): return "" raise ValueError("Only image modality is supported") ``` - Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example: ```diff def forward( self, input_ids: torch.Tensor, positions: torch.Tensor, + pixel_values: torch.Tensor, ) -> SamplerOutput: ``` More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it. - Implement [embed_multimodal][vllm.model_executor.models.interfaces.SupportsMultiModal.embed_multimodal] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. ??? code ```python class YourModelForImage2Seq(nn.Module): ... def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: assert self.vision_encoder is not None image_features = self.vision_encoder(image_input) return self.multi_modal_projector(image_features) def embed_multimodal( self, **kwargs: object, ) -> MultiModalEmbeddings | None: # Validate the multimodal input keyword arguments image_input = self._parse_and_validate_image_input(**kwargs) if image_input is None: return None # Run multimodal inputs through encoder and projector vision_embeddings = self._process_image_input(image_input) return vision_embeddings ``` !!! important The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request. !!! note By default, vLLM merges the multimodal embeddings into text embeddings depending on the information of their locations defined in [PlaceholderRange][vllm.multimodal.inputs.PlaceholderRange] from input processing. This logic can be found at [embed_input_ids][vllm.model_executor.models.interfaces.SupportsMultiModal.embed_input_ids]. You may override this method if additional logic is required for your model when merging embeddings. - Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model. ```python class YourModelForImage2Seq(nn.Module): ... def get_language_model(self) -> torch.nn.Module: # Change `language_model` according to your implementation. return self.language_model ``` - Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. ```diff + from vllm.model_executor.models.interfaces import SupportsMultiModal - class YourModelForImage2Seq(nn.Module): + class YourModelForImage2Seq(nn.Module, SupportsMultiModal): ``` !!! note The model class does not have to be named `*ForCausalLM`. Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples. ## 2. Specify processing information Next, create a subclass of [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] to provide basic information related to HF processing. ### Maximum number of input items You need to override the abstract method [get_supported_mm_limits][vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits] to return the maximum number of input items for each modality supported by the model. For example, if the model supports any number of images but only one video per prompt: ```python def get_supported_mm_limits(self) -> Mapping[str, int | None]: return {"image": None, "video": 1} ``` ## 3. Specify dummy inputs Then, inherit [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] to construct dummy inputs for HF processing as well as memory profiling. ### For memory profiling Override the abstract methods [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text] and [get_dummy_mm_data][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data] to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it. Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens. === "Basic example: LLaVA" Looking at the code of HF's `LlavaForConditionalGeneration`: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544 n_image_tokens = (input_ids == self.config.image_token_index).sum().item() n_image_features = image_features.shape[0] * image_features.shape[1] if n_image_tokens != n_image_features: raise ValueError( f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}" ) special_image_mask = ( (input_ids == self.config.image_token_index) .unsqueeze(-1) .expand_as(inputs_embeds) .to(inputs_embeds.device) ) image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype) inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features) ``` The number of placeholder feature tokens per image is `image_features.shape[1]`. `image_features` is calculated inside the `get_image_features` method: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300 image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) selected_image_feature = image_outputs.hidden_states[vision_feature_layer] if vision_feature_select_strategy == "default": selected_image_feature = selected_image_feature[:, 1:] elif vision_feature_select_strategy == "full": selected_image_feature = selected_image_feature else: raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}") image_features = self.multi_modal_projector(selected_image_feature) return image_features ``` We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model). Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`. The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention mechanism doesn't change the sequence length of the output hidden states. ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102 hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding) hidden_states = self.pre_layrnorm(hidden_states) encoder_outputs = self.encoder( inputs_embeds=hidden_states, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) ``` To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257 target_dtype = self.patch_embedding.weight.dtype patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1) embeddings = torch.cat([class_embeds, patch_embeds], dim=1) if interpolate_pos_encoding: embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width) else: embeddings = embeddings + self.position_embedding(self.position_ids) return embeddings ``` We can infer that `embeddings.shape[1] == self.num_positions`, where ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196 self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 ``` Overall, the number of placeholder feature tokens for an image can be calculated as: ??? code ```python def get_num_image_tokens( self, *, image_width: int, image_height: int, ) -> int: hf_config = self.get_hf_config() hf_processor = self.get_hf_processor() image_size = hf_config.vision_config.image_size patch_size = hf_config.vision_config.patch_size num_image_tokens = (image_size // patch_size) ** 2 + 1 if hf_processor.vision_feature_select_strategy == "default": num_image_tokens -= 1 return num_image_tokens ``` Notice that the number of image tokens doesn't depend on the image width and height. We can simply use a dummy `image_size` to calculate the multimodal profiling data: ??? code ```python # NOTE: In actuality, this is usually implemented as part of the # model's subclass of `BaseProcessingInfo`, but we show it as is # here for simplicity. def get_image_size_with_most_features(self) -> ImageSize: hf_config = self.get_hf_config() width = height = hf_config.image_size return ImageSize(width=width, height=height) def get_dummy_mm_data( self, seq_len: int, mm_counts: Mapping[str, int], mm_options: Mapping[str, BaseDummyOptions] | None = None, ) -> MultiModalDataDict: num_images = mm_counts.get("image", 0) target_width, target_height = \ self.info.get_image_size_with_most_features() image_overrides = mm_options.get("image") if mm_options else None return { "image": self._get_dummy_images(width=target_width, height=target_height, num_images=num_images, overrides=image_overrides) } ``` For the text, we simply expand the multimodal image token from the model config to match the desired number of images. ```python def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str: num_images = mm_counts.get("image", 0) processor = self.info.get_hf_processor() image_token = processor.image_token return image_token * num_images ``` === "No input placeholders: Fuyu" Looking at the code of HF's `FuyuForCausalLM`: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322 if image_patches is not None and past_key_values is None: patch_embeddings = [ self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype)) .squeeze(0) .to(inputs_embeds.device) for patch in image_patches ] inputs_embeds = self.gather_continuous_embeddings( word_embeddings=inputs_embeds, continuous_embeddings=patch_embeddings, image_patch_input_indices=image_patches_indices, ) ``` The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`, which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`. Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information? Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**. The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then `FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`. In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`, returning the dimensions after resizing (but before padding) as metadata. ??? code ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544 image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"]) batch_images = image_encoding["images"] image_unpadded_heights = image_encoding["image_unpadded_heights"] image_unpadded_widths = image_encoding["image_unpadded_widths"] # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L if do_resize: batch_images = [ [self.resize(image, size=size, input_data_format=input_data_format) for image in images] for images in batch_images ] image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images] image_unpadded_heights = [[image_size[0]] for image_size in image_sizes] image_unpadded_widths = [[image_size[1]] for image_size in image_sizes] if do_pad: batch_images = [ [ self.pad_image( image, size=size, mode=padding_mode, constant_values=padding_value, input_data_format=input_data_format, ) for image in images ] for images in batch_images ] ``` In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425 model_image_input = self.image_processor.preprocess_with_tokenizer_info( image_input=tensor_batch_images, image_present=image_present, image_unpadded_h=image_unpadded_heights, image_unpadded_w=image_unpadded_widths, image_placeholder_id=image_placeholder_id, image_newline_id=image_newline_id, variable_sized=True, ) # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658 image_height, image_width = image.shape[1], image.shape[2] if variable_sized: # variable_sized=True new_h = min( image_height, math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height, ) new_w = min( image_width, math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width, ) image = image[:, :new_h, :new_w] image_height, image_width = new_h, new_w num_patches = self.get_num_patches(image_height=image_height, image_width=image_width) tensor_of_image_ids = torch.full( [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device ) patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0) assert num_patches == patches.shape[0] ``` The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562 patch_size = patch_size if patch_size is not None else self.patch_size patch_height, patch_width = self.patch_size["height"], self.patch_size["width"] if image_height % patch_height != 0: raise ValueError(f"{image_height=} must be divisible by {patch_height}") if image_width % patch_width != 0: raise ValueError(f"{image_width=} must be divisible by {patch_width}") num_patches_per_dim_h = image_height // patch_height num_patches_per_dim_w = image_width // patch_width num_patches = num_patches_per_dim_h * num_patches_per_dim_w ``` These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`. ```python def get_image_size_with_most_features(self) -> ImageSize: image_processor = self.get_image_processor() return ImageSize( width=image_processor.size["width"], height=image_processor.size["height"], ) ``` Fuyu does not expect image placeholders in the inputs to HF processor, so the dummy prompt text is empty regardless of the number of images. ```python def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str: return "" ``` For the multimodal image profiling data, the logic is very similar to LLaVA: ??? code ```python def get_dummy_mm_data( self, seq_len: int, mm_counts: Mapping[str, int], mm_options: Optional[Mapping[str, BaseDummyOptions]] = None, ) -> MultiModalDataDict: target_width, target_height = \ self.info.get_image_size_with_most_features() num_images = mm_counts.get("image", 0) image_overrides = mm_options.get("image") if mm_options else None return { "image": self._get_dummy_images( width=target_width, height=target_height, num_images=num_images, overrides=image_overrides, ) } ``` ## 4. Specify processing details Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to fill in the missing details about HF processing. !!! info [Multi-Modal Data Processing](../../design/mm_processing.md) ### Multi-modal fields Override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] to return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items. === "Basic example: LLaVA" The output of `CLIPImageProcessor` is a simple tensor with shape `(num_images, num_channels, image_height, image_width)`: ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345 images = [ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in all_images ] data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) ``` So, we override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows: ```python def _get_mm_fields_config( self, hf_inputs: BatchFeature, hf_processor_mm_kwargs: Mapping[str, object], ) -> Mapping[str, MultiModalFieldConfig]: return dict( pixel_values=MultiModalFieldConfig.batched("image"), ) ``` !!! note Our [actual code](../../../vllm/model_executor/models/llava.py) additionally supports pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument. === "With postprocessing: Fuyu" The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates the patches from each image belonging to an item in the batch: ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679 image_input_ids.append(tensor_of_image_ids) image_patches.append(patches) else: image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device)) batch_image_input_ids.append(image_input_ids) batch_image_patches.append(image_patches) ``` The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore `(1, num_images, num_patches, patch_width * patch_height * num_channels)`. In order to support the use of [MultiModalFieldConfig.batched][vllm.multimodal.inputs.MultiModalFieldConfig.batched] like in LLaVA, we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor]: ??? code ```python def _call_hf_processor( self, prompt: str, mm_data: Mapping[str, object], mm_kwargs: Mapping[str, object], tok_kwargs: Mapping[str, object], ) -> BatchFeature: processed_outputs = super()._call_hf_processor( prompt=prompt, mm_data=mm_data, mm_kwargs=mm_kwargs, tok_kwargs=tok_kwargs, ) image_patches = processed_outputs.get("image_patches") if image_patches is not None: images = mm_data["images"] assert isinstance(images, list) # Original output: (1, num_images, Pn, Px * Py * C) # New output: (num_images, Pn, Px * Py * C) assert (isinstance(image_patches, list) and len(image_patches) == 1) assert (isinstance(image_patches[0], torch.Tensor) and len(image_patches[0]) == len(images)) processed_outputs["image_patches"] = image_patches[0] return processed_outputs ``` !!! note Our [actual code](../../../vllm/model_executor/models/fuyu.py) has special handling for text-only inputs to prevent unnecessary warnings from HF processor. !!! note The `_call_hf_processor` method specifies both `mm_kwargs` and `tok_kwargs` for processing. `mm_kwargs` is used to both initialize and call the huggingface processor, whereas `tok_kwargs` is only used to call the huggingface processor. This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows: ```python def _get_mm_fields_config( self, hf_inputs: BatchFeature, hf_processor_mm_kwargs: Mapping[str, object], ) -> Mapping[str, MultiModalFieldConfig]: return dict(image_patches=MultiModalFieldConfig.batched("image")) ``` ### Prompt updates Override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] to return a list of [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instances. Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies an update operation (e.g.: insertion, replacement) performed by the HF processor. === "Basic example: LLaVA" Looking at HF's `LlavaProcessor`: ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170 prompt_strings = [] for sample in text: sample = sample.replace(self.image_token, self.image_token * num_image_tokens) prompt_strings.append(sample) ``` It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`). Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows: ??? code ```python def _get_prompt_updates( self, mm_items: MultiModalDataItems, hf_processor_mm_kwargs: Mapping[str, object], out_mm_kwargs: MultiModalKwargsItems, ) -> Sequence[PromptUpdate]: hf_config = self.info.get_hf_config() image_token_id = hf_config.image_token_index def get_replacement(item_idx: int): images = mm_items.get_items("image", ImageProcessorItems) image_size = images.get_image_size(item_idx) num_image_tokens = self.info.get_num_image_tokens( image_width=image_size.width, image_height=image_size.height, ) return [image_token_id] * num_image_tokens return [ PromptReplacement( modality="image", target=[image_token_id], replacement=get_replacement, ), ] ``` === "Handling additional tokens: Fuyu" Recall the layout of feature tokens from Step 2: ``` |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE| |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE| ... |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE| ``` We define a helper function to return `ncols` and `nrows` directly: ??? code ```python def get_image_feature_grid_size( self, *, image_width: int, image_height: int, ) -> tuple[int, int]: image_processor = self.get_image_processor() target_width = image_processor.size["width"] target_height = image_processor.size["height"] patch_width = image_processor.patch_size["width"] patch_height = image_processor.patch_size["height"] if not (image_width <= target_width and image_height <= target_height): height_scale_factor = target_height / image_height width_scale_factor = target_width / image_width optimal_scale_factor = min(height_scale_factor, width_scale_factor) image_height = int(image_height * optimal_scale_factor) image_width = int(image_width * optimal_scale_factor) ncols = math.ceil(image_width / patch_width) nrows = math.ceil(image_height / patch_height) return ncols, nrows ``` Based on this, we can initially define our replacement tokens as: ??? code ```python def get_replacement(item_idx: int): images = mm_items.get_items("image", ImageProcessorItems) image_size = images.get_image_size(item_idx) ncols, nrows = self.info.get_image_feature_grid_size( image_width=image_size.width, image_height=image_size.height, ) # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|` # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|` return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows ``` However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called, a BOS token (``) is also added to the promopt: ??? code ```python # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435 model_image_input = self.image_processor.preprocess_with_tokenizer_info( image_input=tensor_batch_images, image_present=image_present, image_unpadded_h=image_unpadded_heights, image_unpadded_w=image_unpadded_widths, image_placeholder_id=image_placeholder_id, image_newline_id=image_newline_id, variable_sized=True, ) prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch( tokenizer=self.tokenizer, prompts=prompts, scale_factors=scale_factors, max_tokens_to_generate=self.max_tokens_to_generate, max_position_embeddings=self.max_position_embeddings, add_BOS=True, add_beginning_of_answer_token=True, ) ``` To assign the vision embeddings to only the image tokens, instead of a string you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]: ??? code ```python hf_config = self.info.get_hf_config() bos_token_id = hf_config.bos_token_id # `` assert isinstance(bos_token_id, int) def get_replacement_fuyu(item_idx: int): images = mm_items.get_items("image", ImageProcessorItems) image_size = images.get_image_size(item_idx) ncols, nrows = self.info.get_image_feature_grid_size( image_width=image_size.width, image_height=image_size.height, ) image_tokens = ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows return PromptUpdateDetails.select_token_id( image_tokens + [bos_token_id], embed_token_id=_IMAGE_TOKEN_ID, ) ``` Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt, we can search for it to conduct the replacement at the start of the string: ??? code ```python def _get_prompt_updates( self, mm_items: MultiModalDataItems, hf_processor_mm_kwargs: Mapping[str, object], out_mm_kwargs: MultiModalKwargsItems, ) -> Sequence[PromptUpdate]: hf_config = self.info.get_hf_config() bos_token_id = hf_config.bos_token_id assert isinstance(bos_token_id, int) tokenizer = self.info.get_tokenizer() eot_token_id = tokenizer.bos_token_id assert isinstance(eot_token_id, int) def get_replacement_fuyu(item_idx: int): images = mm_items.get_items("image", ImageProcessorItems) image_size = images.get_image_size(item_idx) ncols, nrows = self.info.get_image_feature_grid_size( image_width=image_size.width, image_height=image_size.height, ) image_tokens = ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows return PromptUpdateDetails.select_token_id( image_tokens + [bos_token_id], embed_token_id=_IMAGE_TOKEN_ID, ) return [ PromptReplacement( modality="image", target=[eot_token_id], replacement=get_replacement_fuyu, ) ] ``` ## 5. Register processor-related classes After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2), [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3), and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4), decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.registry.MultiModalRegistry.register_processor] to register them to the multi-modal registry: ```diff from vllm.model_executor.models.interfaces import SupportsMultiModal + from vllm.multimodal import MULTIMODAL_REGISTRY + @MULTIMODAL_REGISTRY.register_processor( + YourMultiModalProcessor, + info=YourProcessingInfo, + dummy_inputs=YourDummyInputsBuilder, + ) class YourModelForImage2Seq(nn.Module, SupportsMultiModal): ``` ## Notes ### Inserting feature tokens without replacement Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use [PromptInsertion][vllm.multimodal.processing.PromptInsertion] instead of [PromptReplacement][vllm.multimodal.processing.PromptReplacement] inside [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. Examples: - BLIP-2 (insert at start of prompt): [vllm/model_executor/models/blip2.py](../../../vllm/model_executor/models/blip2.py) - Molmo (insert after `<|endoftext|>` token): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py) ### Handling prompt updates unrelated to multi-modal data [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](../../design/mm_processing.md). Examples: - Chameleon (appends `sep_token`): [vllm/model_executor/models/chameleon.py](../../../vllm/model_executor/models/chameleon.py) - Fuyu (appends `boa_token`): [vllm/model_executor/models/fuyu.py](../../../vllm/model_executor/models/fuyu.py) - Molmo (applies chat template which is not defined elsewhere): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py) ### Custom HF processor Some models don't define an HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to [_call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor]. Examples: - DeepSeek-VL2: [vllm/model_executor/models/deepseek_vl2.py](../../../vllm/model_executor/models/deepseek_vl2.py) - InternVL: [vllm/model_executor/models/internvl.py](../../../vllm/model_executor/models/internvl.py) - Qwen-VL: [vllm/model_executor/models/qwen_vl.py](../../../vllm/model_executor/models/qwen_vl.py) --- # Registering a Model vLLM relies on a model registry to determine how to run each model. A list of pre-registered architectures can be found [here](../../models/supported_models.md). If your model is not on this list, you must register it to vLLM. This page provides detailed instructions on how to do so. ## Built-in models To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source](../../getting_started/installation/gpu.md#build-wheel-from-source). This gives you the ability to modify the codebase and test your model. After you have implemented your model (see [tutorial](basic.md)), put it into the [vllm/model_executor/models](../../../vllm/model_executor/models) directory. Then, add your model class to `_VLLM_MODELS` in [vllm/model_executor/models/registry.py](../../../vllm/model_executor/models/registry.py) so that it is automatically registered upon importing vLLM. Finally, update our [list of supported models](../../models/supported_models.md) to promote your model! !!! important The list of models in each section should be maintained in alphabetical order. ## Out-of-tree models You can load an external model [using a plugin](../../design/plugin_system.md) without modifying the vLLM codebase. To register the model, use the following code: ```python # The entrypoint of your plugin def register(): from vllm import ModelRegistry from your_code import YourModelForCausalLM ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM) ``` If your model imports modules that initialize CUDA, consider lazy-importing it to avoid errors like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`: ```python # The entrypoint of your plugin def register(): from vllm import ModelRegistry ModelRegistry.register_model( "YourModelForCausalLM", "your_code:YourModelForCausalLM", ) ``` !!! important If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. Read more about that [here](multimodal.md). --- # Unit Testing This page explains how to write unit tests to verify the implementation of your model. ## Required Tests These tests are necessary to get your PR merged into vLLM library. Without them, the CI for your PR will fail. ### Model loading Include an example HuggingFace repository for your model in [tests/models/registry.py](../../../tests/models/registry.py). This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. !!! important The list of models in each section should be maintained in alphabetical order. !!! tip If your model requires a development version of HF Transformers, you can set `min_transformers_version` to skip the test in CI until the model is released. ## Optional Tests These tests are optional to get your PR merged into vLLM library. Passing these tests provides more confidence that your implementation is correct, and helps avoid future regressions. ### Model correctness These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of [tests/models](../../../tests/models). #### Generative models For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in [tests/models/utils.py](../../../tests/models/utils.py): - Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF. - Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa. #### Pooling models For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py). ### Multi-modal processing #### Common tests Adding your model to [tests/models/multimodal/processing/test_common.py](../../../tests/models/multimodal/processing/test_common.py) verifies that the following input combinations result in the same outputs: - Text + multi-modal data - Tokens + multi-modal data - Text + cached multi-modal data - Tokens + cached multi-modal data #### Model-specific tests You can add a new file under [tests/models/multimodal/processing](../../../tests/models/multimodal/processing) to run tests that only apply to your model. For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in [tests/models/multimodal/processing/test_phi3v.py](../../../tests/models/multimodal/processing/test_phi3v.py). --- # Speech-to-Text (Transcription/Translation) Support This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription]. Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance. ## Update the base vLLM model It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods. ### `supported_languages` and `supports_transcription_only` Declare supported languages and capabilities: - The `supported_languages` mapping is validated at init time. - Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper). ??? code "supported_languages and supports_transcription_only" ```python from typing import ClassVar, Mapping, Literal import numpy as np import torch from torch import nn from vllm.config import ModelConfig, SpeechToTextConfig from vllm.inputs.data import PromptType from vllm.model_executor.models.interfaces import SupportsTranscription class YourASRModel(nn.Module, SupportsTranscription): # Map of ISO 639-1 language codes to language names supported_languages: ClassVar[Mapping[str, str]] = { "en": "English", "it": "Italian", # ... add more as needed } # If your model only supports audio-conditioned generation # (no text-only generation), enable this flag. supports_transcription_only: ClassVar[bool] = True ``` Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config]. This is for controlling general behavior of the API when serving your model: ??? code "get_speech_to_text_config()" ```python class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_speech_to_text_config( cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"], ) -> SpeechToTextConfig: return SpeechToTextConfig( sample_rate=16_000, max_audio_clip_s=30, # Set to None to disable server-side chunking if your # model/processor handles it already min_energy_split_window_size=None, ) ``` See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls. Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns: #### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n) Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`: ??? code "get_generation_prompt()" ```python class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_generation_prompt( cls, audio: np.ndarray, stt_config: SpeechToTextConfig, model_config: ModelConfig, language: str | None, task_type: Literal["transcribe", "translate"], request_prompt: str, to_language: str | None, ) -> PromptType: # Example with a free-form instruction prompt task_word = "Transcribe" if task_type == "transcribe" else "Translate" prompt = ( "user\n" f"{task_word} this audio: " "\nmodel\n" ) return { "multi_modal_data": {"audio": (audio, stt_config.sample_rate)}, "prompt": prompt, } ``` For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md). #### Encoder–decoder audio-only (e.g., Whisper) Return a dict with separate `encoder_prompt` and `decoder_prompt` entries: ??? code "get_generation_prompt()" ```python class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_generation_prompt( cls, audio: np.ndarray, stt_config: SpeechToTextConfig, model_config: ModelConfig, language: str | None, task_type: Literal["transcribe", "translate"], request_prompt: str, to_language: str | None, ) -> PromptType: if language is None: raise ValueError("Language must be specified") prompt = { "encoder_prompt": { "prompt": "", "multi_modal_data": { "audio": (audio, stt_config.sample_rate), }, }, "decoder_prompt": ( (f"<|prev|>{request_prompt}" if request_prompt else "") + f"<|startoftranscript|><|{language}|>" + f"<|{task_type}|><|notimestamps|>" ), } return cast(PromptType, prompt) ``` ### `validate_language` (optional) Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language] If your model requires a language and you want a default, override this method (see Whisper): ??? code "validate_language()" ```python @classmethod def validate_language(cls, language: str | None) -> str | None: if language is None: logger.warning( "Defaulting to language='en'. If you wish to transcribe " "audio in a different language, pass the `language` field " "in the TranscriptionRequest." ) language = "en" return super().validate_language(language) ``` ### `get_num_audio_tokens` (optional) Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens] Provide a fast duration→token estimate to improve streaming usage statistics: ??? code "get_num_audio_tokens()" ```python class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_num_audio_tokens( cls, audio_duration_s: float, stt_config: SpeechToTextConfig, model_config: ModelConfig, ) -> int | None: # Return None if unknown; otherwise return an estimate. return int(audio_duration_s * stt_config.sample_rate // 320) # example ``` ## Audio preprocessing and chunking The API server takes care of basic audio I/O and optional chunking before building prompts: - Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `librosa`. - Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s`, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second`. - Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words. Relevant server logic: ??? code "_preprocess_speech_to_text()" ```python # vllm/entrypoints/openai/speech_to_text.py async def _preprocess_speech_to_text(...): language = self.model_cls.validate_language(request.language) ... y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate) duration = librosa.get_duration(y=y, sr=sr) do_split_audio = (self.asr_config.allow_audio_chunking and duration > self.asr_config.max_audio_clip_s) chunks = [y] if not do_split_audio else self._split_audio(y, int(sr)) prompts = [] for chunk in chunks: prompt = self.model_cls.get_generation_prompt( audio=chunk, stt_config=self.asr_config, model_config=self.model_config, language=language, task_type=self.task_type, request_prompt=request.prompt, to_language=to_language, ) prompts.append(prompt) return prompts, duration ``` ## Exposing tasks automatically vLLM automatically advertises transcription support if your model implements the interface: ```python if supports_transcription(model): if model.supports_transcription_only: return ["transcription"] supported_tasks.append("transcription") ``` When enabled, the server initializes the transcription and translation handlers: ```python state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None ``` No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`. ## Examples in-tree - Whisper encoder–decoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py) - Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py). Make sure to have installed `mistral-common[audio]`. - Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py) ## Test with the API Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI): - Transcription (ASR): ```bash curl -s -X POST \ -H "Authorization: Bearer $VLLM_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/audio.wav" \ -F "model=$MODEL_ID" \ http://localhost:8000/v1/audio/transcriptions ``` - Translation (source → English unless otherwise supported): ```bash curl -s -X POST \ -H "Authorization: Bearer $VLLM_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/audio.wav" \ -F "model=$MODEL_ID" \ http://localhost:8000/v1/audio/translations ``` Or check out more examples in [examples/online_serving](../../../examples/online_serving). !!! note - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking. - Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass. - For multilingual behavior, keep `supported_languages` aligned with actual model capabilities. --- # Profiling vLLM !!! warning Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference. ## Profile with PyTorch Profiler We support tracing vLLM workers using the `torch.profiler` module. You can enable the torch profiler by setting `--profiler-config` when launching the server, and setting the entries `profiler` to `'torch'` and `torch_profiler_dir` to the directory where you want to save the traces. Additionally, you can control the profiling content by specifying the following additional arguments in the config: - `torch_profiler_record_shapes` to enable recording Tensor Shapes, off by default - `torch_profiler_with_memory` to record memory, off by default - `torch_profiler_with_stack` to enable recording stack information, on by default - `torch_profiler_with_flops` to enable recording FLOPs, off by default - `torch_profiler_use_gzip` to control gzip-compressing profiling files, on by default - `torch_profiler_dump_cuda_time_total` to control dumping and printing the aggregated CUDA self time table, on by default When using `vllm bench serve`, you can enable profiling by passing the `--profile` flag. Traces can be visualized using . !!! tip You can directly call bench module without installing vLLM using `python -m vllm.entrypoints.cli.main bench`. !!! tip Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly. !!! tip To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100. Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes. `export VLLM_RPC_TIMEOUT=1800000` ### Example commands and usage #### Offline Inference Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline_inference/simple_profiling.py) for an example. #### OpenAI Server ```bash vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}' ``` vllm bench command: ```bash vllm bench serve \ --backend vllm \ --model meta-llama/Llama-3.1-8B-Instruct \ --dataset-name sharegpt \ --dataset-path sharegpt.json \ --profile \ --num-prompts 2 ``` Or use http request: ```shell # We need first call /start_profile api to start profile. $ curl -X POST http://localhost:8000/start_profile # Call model generate. curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "San Francisco is a" } ] }' # After need call /stop_profile api to stop profile. $ curl -X POST http://localhost:8000/stop_profile ``` ## Profile with NVIDIA Nsight Systems Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events. [Install nsight-systems](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html) using your package manager. The following block is an example for Ubuntu. ```bash apt update apt install -y --no-install-recommends gnupg echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub apt update apt install nsight-systems-cli ``` !!! tip When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues). The Nsight Systems profiler can be launched with `nsys profile ...`, with a few recommended flags for vLLM: `--trace-fork-before-exec=true --cuda-graph-trace=node`. ### Example commands and usage #### Offline Inference For basic usage, you can just append the profiling command before any existing script you would run for offline inference. The following is an example using the `vllm bench latency` script: ```bash nsys profile \ --trace-fork-before-exec=true \ --cuda-graph-trace=node \ vllm bench latency \ --model meta-llama/Llama-3.1-8B-Instruct \ --num-iters-warmup 5 \ --num-iters 1 \ --batch-size 16 \ --input-len 512 \ --output-len 8 ``` #### OpenAI Server To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, but you will need to specify a few other arguments to enable dynamic capture similarly to the Torch Profiler: ```bash # server nsys profile \ --trace-fork-before-exec=true \ --cuda-graph-trace=node \ --capture-range=cudaProfilerApi \ --capture-range-end repeat \ vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config.profiler cuda # client vllm bench serve \ --backend vllm \ --model meta-llama/Llama-3.1-8B-Instruct \ --dataset-name sharegpt \ --dataset-path sharegpt.json \ --profile \ --num-prompts 2 ``` With `--profile`, vLLM will capture a profile for each run of `vllm bench serve`. Once the server is killed, the profiles will all be saved. #### Analysis You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started). ??? console "CLI example" ```bash nsys stats report1.nsys-rep ... ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… 12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off… 9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_… 5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel, (bool)1>(T1 *, cons… 4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel(int)0&&vllm::_typeConvert::exists, void>::type vllm::fused_add_rms_norm_kern… 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel(const long *, T1 *, T1 *, const T1 *, in… 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… ... ``` GUI example: ## Continuous Profiling There is a [GitHub CI workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-profiling.yml) in the PyTorch infrastructure repository that provides continuous profiling for different models on vLLM. This automated profiling helps track performance characteristics over time and across different model configurations. ### How It Works The workflow currently runs weekly profiling sessions for selected models, generating detailed performance traces that can be analyzed using different tools to identify performance regressions or optimization opportunities. But, it can be triggered manually as well, using the Github Action tool. ### Adding New Models To extend the continuous profiling to additional models, you can modify the [profiling-tests.json](https://github.com/pytorch/pytorch-integration-testing/blob/main/vllm-profiling/cuda/profiling-tests.json) configuration file in the PyTorch integration testing repository. Simply add your model specifications to this file to include them in the automated profiling runs. ### Viewing Profiling Results The profiling traces generated by the continuous profiling workflow are publicly available on the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm). Look for the **Profiling traces** table to access and download the traces for different models and runs. ## Profiling vLLM Python Code The Python standard library includes [cProfile](https://docs.python.org/3/library/profile.html) for profiling Python code. vLLM includes a couple of helpers that make it easy to apply it to a section of vLLM. Both the `vllm.utils.profiling.cprofile` and `vllm.utils.profiling.cprofile_context` functions can be used to profile a section of code. !!! note The legacy import paths `vllm.utils.cprofile` and `vllm.utils.cprofile_context` are deprecated. Please use `vllm.utils.profiling.cprofile` and `vllm.utils.profiling.cprofile_context` instead. ### Example usage - decorator The first helper is a Python decorator that can be used to profile a function. If a filename is specified, the profile will be saved to that file. If no filename is specified, profile data will be printed to stdout. ```python from vllm.utils.profiling import cprofile @cprofile("expensive_function.prof") def expensive_function(): # some expensive code pass ``` ### Example Usage - context manager The second helper is a context manager that can be used to profile a block of code. Similar to the decorator, the filename is optional. ```python from vllm.utils.profiling import cprofile_context def another_function(): # more expensive code pass with cprofile_context("another_function.prof"): another_function() ``` ### Analyzing Profile Results There are multiple tools available that can help analyze the profile results. One example is [snakeviz](https://jiffyclub.github.io/snakeviz/). ```bash pip install snakeviz snakeviz expensive_function.prof ``` ### Analyzing Garbage Collection Costs Leverage VLLM_GC_DEBUG environment variable to debug GC costs. - VLLM_GC_DEBUG=1: enable GC debugger with gc.collect elapsed times - VLLM_GC_DEBUG='{"top_objects":5}': enable GC debugger to log top 5 collected objects for each gc.collect --- # Vulnerability Management ## Reporting Vulnerabilities As mentioned in the [security policy](https://github.com/vllm-project/vllm/tree/main/SECURITY.md), security vulnerabilities may be reported privately to the project via [GitHub](https://github.com/vllm-project/vllm/security/advisories/new). ## Vulnerability Management Team Once a vulnerability has been reported to the project, the Vulnerability Management Team (VMT) is responsible for managing the vulnerability. The VMT is responsible for: - Triaging the vulnerability. - Coordinating with reporters and project maintainers on vulnerability analysis and resolution. - Drafting of security advisories for confirmed vulnerabilities, as appropriate. - Coordination with project maintainers on a coordinated release of the fix and security advisory. ### Security Advisories Advisories are published via GitHub through the same system used to report vulnerabilities. More information on the process can be found in the [GitHub documentation](https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/about-repository-security-advisories). ### Team Members We prefer to keep all vulnerability-related communication on the security report on GitHub. However, if you need to contact the VMT directly for an urgent issue, you may contact the following individuals: - Simon Mo - - Russell Bryant - - Huzaifa Sidhpurwala - ## Slack Discussion You may use the `#security` channel in the [vLLM Slack](https://slack.vllm.ai) to discuss security-related topics. However, please do not disclose any vulnerabilities in this channel. If you need to report a vulnerability, please use the GitHub security advisory system or contact a VMT member privately. ## Vulnerability Disclosure The process for disclosing vulnerabilities is the following: - The VMT will work with the project maintainers to develop a fix for the vulnerability. - The VMT will coordinate with the reporter and project maintainers to prepare a security advisory that adequately describes the vulnerability and its impact. - The VMT will coordinate with the project maintainers to publish a fix and release an update that includes that fix. - The VMT will publish the security advisory on GitHub. Release notes will be updated to include a reference to the security advisory. The VMT and project maintainers will work to minimize the amount of time in between disclosing any public information about the vulnerability and making a release and advisory available. --- # Using Docker ## Use vLLM's Official Docker Image vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags). ```bash docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model Qwen/Qwen3-0.6B ``` This image can also be used with other container engines such as [Podman](https://podman.io/). ```bash podman run --device nvidia.com/gpu=all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ docker.io/vllm/vllm-openai:latest \ --model Qwen/Qwen3-0.6B ``` You can add any other [engine-args](../configuration/engine_args.md) you need after the image tag (`vllm/vllm-openai:latest`). !!! note You can either use the `ipc=host` flag or `--shm-size` flag to allow the container to access the host's shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference. !!! note Optional dependencies are not included in order to avoid licensing issues (e.g. ). If you need to use those dependencies (having accepted the license terms), create a custom Dockerfile on top of the base image with an extra layer that installs them: ```Dockerfile FROM vllm/vllm-openai:v0.11.0 # e.g. install the `audio` optional dependencies # NOTE: Make sure the version of vLLM matches the base image! RUN uv pip install --system vllm[audio]==0.11.0 ``` !!! tip Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers). To use the development version of `transformers`, create a custom Dockerfile on top of the base image with an extra layer that installs their code from source: ```Dockerfile FROM vllm/vllm-openai:latest RUN uv pip install --system git+https://github.com/huggingface/transformers.git ``` ## Building vLLM's Docker Image from Source You can build and run vLLM from source via the provided [docker/Dockerfile](../../docker/Dockerfile). To build vLLM: ```bash # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 DOCKER_BUILDKIT=1 docker build . \ --target vllm-openai \ --tag vllm/vllm-openai \ --file docker/Dockerfile ``` !!! note By default vLLM will build for all GPU types for widest distribution. If you are just building for the current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""` for vLLM to find the current GPU type and build for that. If you are using Podman instead of Docker, you might need to disable SELinux labeling by adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184). !!! note If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time. * **Enable the feature** by adding the build argument: `--build-arg VLLM_USE_PRECOMPILED="1"`. * **How it works**: By default, vLLM automatically finds the correct wheels from our [Nightly Builds](../contributing/ci/nightly_builds.md) by using the merge-base commit with the upstream `main` branch. * **Override commit**: To use wheels from a specific commit, provide the `--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=` argument. For a detailed explanation, refer to the documentation on 'Set up using Python-only build (without compilation)' part in [Build wheel from source](../contributing/ci/nightly_builds.md#precompiled-wheels-usage), these args are similar. ## Building for Arm64/aarch64 A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag `--platform "linux/arm64"` will build for arm64. !!! note Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=` flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits. Keep an eye on memory usage with parallel jobs as it can be substantial (see example below). ??? console "Command" ```bash # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB) DOCKER_BUILDKIT=1 docker build . \ --file docker/Dockerfile \ --target vllm-openai \ --platform "linux/arm64" \ -t vllm/vllm-gh200-openai:latest \ --build-arg max_jobs=66 \ --build-arg nvcc_threads=2 \ --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ --build-arg RUN_WHEEL_CHECK=false ``` For (G)B300, we recommend using CUDA 13, as shown in the following command. ??? console "Command" ```bash DOCKER_BUILDKIT=1 docker build \ --build-arg CUDA_VERSION=13.0.1 \ --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \ --build-arg max_jobs=256 \ --build-arg nvcc_threads=2 \ --build-arg RUN_WHEEL_CHECK=false \ --build-arg torch_cuda_arch_list='9.0 10.0+PTX' \ --platform "linux/arm64" \ --tag vllm/vllm-gb300-openai:latest \ --target vllm-openai \ -f docker/Dockerfile \ . ``` !!! note If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution. Run the following command on your host machine to register QEMU user static handlers: ```bash docker run --rm --privileged multiarch/qemu-user-static --reset -p yes ``` After setting up QEMU, you can use the `--platform "linux/arm64"` flag in your `docker build` command. ## Use the custom-built vLLM Docker image To run vLLM with the custom-built Docker image: ```bash docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --env "HF_TOKEN=" \ vllm/vllm-openai ``` The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command). !!! note **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` . --- # Anyscale [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like [examples/online_serving/run_cluster.sh](../../../examples/online_serving/run_cluster.sh). When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm). ## Production-ready vLLM on Anyscale quickstarts - [Offline batch inference](https://console.anyscale.com/template-preview/llm_batch_inference?utm_source=vllm_docs) - [Deploy vLLM services](https://console.anyscale.com/template-preview/llm_serving?utm_source=vllm_docs) - [Curate a dataset](https://console.anyscale.com/template-preview/audio-dataset-curation-llm-judge?utm_source=vllm_docs) - [Finetune an LLM](https://console.anyscale.com/template-preview/entity-recognition-with-llms?utm_source=vllm_docs) --- # AnythingLLM [AnythingLLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. ## Prerequisites Set up the vLLM environment: ```bash pip install vllm ``` ## Deploy 1. Start the vLLM server with a supported chat-completion model, for example: ```bash vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ``` 1. Download and install [AnythingLLM Desktop](https://anythingllm.com/desktop). 1. Configure the AI provider: - At the bottom, click the 🔧 wrench icon -> **Open settings** -> **AI Providers** -> **LLM**. - Enter the following values: - LLM Provider: Generic OpenAI - Base URL: `http://{vllm server host}:{vllm server port}/v1` - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` ![set AI providers](../../assets/deployment/anything-llm-provider.png) 1. Create a workspace: 1. At the bottom, click the ↺ back icon and back to workspaces. 1. Create a workspace (e.g., `vllm`) and start chatting. ![create a workspace](../../assets/deployment/anything-llm-chat-without-doc.png) 1. Add a document. 1. Click the 📎 attachment icon. 1. Upload a document. 1. Select and move the document into your workspace. 1. Save and embed it. ![add a document](../../assets/deployment/anything-llm-upload-doc.png) 1. Chat using your document as context. ![chat with your context](../../assets/deployment/anything-llm-chat-with-doc.png) --- # AutoGen [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans. ## Prerequisites Set up the vLLM and [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment: ```bash pip install vllm # Install AgentChat and OpenAI client from Extensions # AutoGen requires Python 3.10 or later. pip install -U "autogen-agentchat" "autogen-ext[openai]" ``` ## Deploy 1. Start the vLLM server with the supported chat completion model, e.g. ```bash vllm serve mistralai/Mistral-7B-Instruct-v0.2 ``` 1. Call it with AutoGen: ??? code ```python import asyncio from autogen_core.models import UserMessage from autogen_ext.models.openai import OpenAIChatCompletionClient from autogen_core.models import ModelFamily async def main() -> None: # Create a model client model_client = OpenAIChatCompletionClient( model="mistralai/Mistral-7B-Instruct-v0.2", base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1", api_key="EMPTY", model_info={ "vision": False, "function_calling": False, "json_output": False, "family": ModelFamily.MISTRAL, "structured_output": True, }, ) messages = [UserMessage(content="Write a very short story about a dragon.", source="user")] # Create a stream. stream = model_client.create_stream(messages=messages) # Iterate over the stream and print the responses. print("Streamed responses:") async for response in stream: if isinstance(response, str): # A partial response is a string. print(response, flush=True, end="") else: # The last response is a CreateResult object with the complete message. print("\n\n------------\n") print("The complete response:", flush=True) print(response.content, flush=True) # Close the client when done. await model_client.close() asyncio.run(main()) ``` For details, see the tutorial: - [Using vLLM in AutoGen](https://microsoft.github.io/autogen/0.2/docs/topics/non-openai-models/local-vllm/) - [OpenAI-compatible API examples](https://microsoft.github.io/autogen/stable/reference/python/autogen_ext.models.openai.html#autogen_ext.models.openai.OpenAIChatCompletionClient) --- # BentoML [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes. For details, see the tutorial [vLLM inference in the BentoML documentation](https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html). --- # Cerebrium

vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications. To install the Cerebrium client, run: ```bash pip install cerebrium cerebrium login ``` Next, create your Cerebrium project, run: ```bash cerebrium init vllm-project ``` Next, to install the required packages, add the following to your cerebrium.toml: ```toml [cerebrium.deployment] docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04" [cerebrium.dependencies.pip] vllm = "latest" ``` Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`: ??? code ```python from vllm import LLM, SamplingParams llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): sampling_params = SamplingParams(temperature=temperature, top_p=top_p) outputs = llm.generate(prompts, sampling_params) # Print the outputs. results = [] for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text results.append({"prompt": prompt, "generated_text": generated_text}) return {"results": results} ``` Then, run the following code to deploy it to the cloud: ```bash cerebrium deploy ``` If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) ??? console "Command" ```bash curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ -H 'Content-Type: application/json' \ -H 'Authorization: ' \ --data '{ "prompts": [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is" ] }' ``` You should get a response like: ??? console "Response" ```json { "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", "result": { "result": [ { "prompt": "Hello, my name is", "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" }, { "prompt": "The president of the United States is", "generated_text": " elected every four years. This is a democratic system.\n\n5. What" }, { "prompt": "The capital of France is", "generated_text": " Paris.\n" }, { "prompt": "The future of AI is", "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." } ] }, "run_time_ms": 152.53663063049316 } ``` You now have an autoscaling endpoint where you only pay for the compute you use! --- # Chatbox [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. ## Prerequisites Set up the vLLM environment: ```bash pip install vllm ``` ## Deploy 1. Start the vLLM server with the supported chat completion model, e.g. ```bash vllm serve qwen/Qwen1.5-0.5B-Chat ``` 1. Download and install [Chatbox desktop](https://chatboxai.app/en#download). 1. On the bottom left of settings, Add Custom Provider - API Mode: `OpenAI API Compatible` - Name: vllm - API Host: `http://{vllm server host}:{vllm server port}/v1` - API Path: `/chat/completions` - Model: `qwen/Qwen1.5-0.5B-Chat` ![Chatbox settings screen](../../assets/deployment/chatbox-settings.png) 1. Go to `Just chat`, and start to chat: ![Chatbot chat screen](../../assets/deployment/chatbox-chat.png) --- # Dify [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production. It supports vLLM as a model provider to efficiently serve large language models. This guide walks you through deploying Dify using a vLLM backend. ## Prerequisites Set up the vLLM environment: ```bash pip install vllm ``` And install [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/). ## Deploy 1. Start the vLLM server with the supported chat completion model, e.g. ```bash vllm serve Qwen/Qwen1.5-7B-Chat ``` 1. Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)): ```bash git clone https://github.com/langgenius/dify.git cd dify cd docker cp .env.example .env docker compose up -d ``` 1. Open the browser to access `http://localhost/install`, config the basic login information and login. 1. In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it. 1. Fill in the model provider details as follows: - **Model Type**: `LLM` - **Model Name**: `Qwen/Qwen1.5-7B-Chat` - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` - **Completion Mode**: `Completion` ![Dify settings screen](../../assets/deployment/dify-settings.png) 1. To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type: ![Dify create chatbot screen](../../assets/deployment/dify-create-chatbot.png) 1. Click the chatbot you just created to open the chat interface and start interacting with the model: ![Dify chat screen](../../assets/deployment/dify-chat.png) --- # dstack

vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment. To install dstack client, run: ```bash pip install dstack[all] dstack server ``` Next, to configure your dstack project, run: ```bash mkdir -p vllm-dstack cd vllm-dstack dstack init ``` Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: ??? code "Config" ```yaml type: service python: "3.11" env: - MODEL=NousResearch/Llama-2-7b-chat-hf port: 8000 resources: gpu: 24GB commands: - pip install vllm - vllm serve $MODEL --port 8000 model: format: openai type: chat name: NousResearch/Llama-2-7b-chat-hf ``` Then, run the following CLI for provisioning: ??? console "Command" ```console $ dstack run . -f serve.dstack.yml ⠸ Getting run plan... Configuration serve.dstack.yml Project deep-diver-main User deep-diver Min resources 2..xCPU, 8GB.., 1xGPU (24GB) Max price - Max duration - Spot policy auto Retry policy no # BACKEND REGION INSTANCE RESOURCES SPOT PRICE 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 ... Shown 3 of 193 offers, $5.876 max Continue? [y/n]: y ⠙ Submitting run... ⠏ Launching spicy-treefrog-1 (pulling) spicy-treefrog-1 provisioning completed (running) Service is published at ... ``` After the provisioning, you can interact with the model by using the OpenAI SDK: ??? code ```python from openai import OpenAI client = OpenAI( base_url="https://gateway.", api_key="", ) completion = client.chat.completions.create( model="NousResearch/Llama-2-7b-chat-hf", messages=[ { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming.", } ], ) print(completion.choices[0].message.content) ``` !!! note dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm) --- # Haystack [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. ## Prerequisites Set up the vLLM and Haystack environment: ```bash pip install vllm haystack-ai ``` ## Deploy 1. Start the vLLM server with the supported chat completion model, e.g. ```bash vllm serve mistralai/Mistral-7B-Instruct-v0.1 ``` 1. Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. ??? code ```python from haystack.components.generators.chat import OpenAIChatGenerator from haystack.dataclasses import ChatMessage from haystack.utils import Secret generator = OpenAIChatGenerator( # for compatibility with the OpenAI API, a placeholder api_key is needed api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), model="mistralai/Mistral-7B-Instruct-v0.1", api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1", generation_kwargs={"max_tokens": 512}, ) response = generator.run( messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")] ) print("-"*30) print(response) print("-"*30) ``` ```console ------------------------------ {'replies': [ChatMessage(_role=, _content=[TextContent(text=' Of course! Where in Italy would you like to go and what type of trip are you looking to plan?')], _name=None, _meta={'model': 'mistralai/Mistral-7B-Instruct-v0.1', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 23, 'prompt_tokens': 21, 'total_tokens': 44, 'completion_tokens_details': None, 'prompt_tokens_details': None}})]} ------------------------------ ``` For details, see the tutorial [Using vLLM in Haystack](https://github.com/deepset-ai/haystack-integrations/blob/main/integrations/vllm.md). --- # Helm A Helm chart to deploy vLLM for Kubernetes Helm is a package manager for Kubernetes. It helps automate the deployment of vLLM applications on Kubernetes. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values. This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for Helm installation and documentation on architecture and values file. ## Prerequisites Before you begin, ensure that you have the following: - A running Kubernetes cluster - NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) - Available GPU resources in your cluster - (Optional) An S3 bucket or other storage with the model weights, if using automatic model download ## Installing the chart To install the chart with the release name `test-vllm`: ```bash helm upgrade --install --create-namespace \ --namespace=ns-vllm test-vllm . \ -f values.yaml \ --set secrets.s3endpoint=$ACCESS_POINT \ --set secrets.s3bucketname=$BUCKET \ --set secrets.s3accesskeyid=$ACCESS_KEY \ --set secrets.s3accesskey=$SECRET_KEY ``` ## Uninstalling the chart To uninstall the `test-vllm` deployment: ```bash helm uninstall test-vllm --namespace=ns-vllm ``` The command removes all the Kubernetes components associated with the chart **including persistent volumes** and deletes the release. ## Architecture ![helm deployment architecture](../../assets/deployment/architecture_helm_deployment.png) ## Values The following table describes configurable parameters of the chart in `values.yaml`: | Key | Type | Default | Description | |-----|------|---------|-------------| | autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration | | autoscaling.enabled | bool | false | Enable autoscaling | | autoscaling.maxReplicas | int | 100 | Maximum replicas | | autoscaling.minReplicas | int | 1 | Minimum replicas | | autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling | | configs | object | {} | Configmap | | containerPort | int | 8000 | Container port | | customObjects | list | [] | Custom Objects configuration | | deploymentStrategy | object | {} | Deployment strategy configuration | | externalConfigs | list | [] | External configuration | | extraContainers | list | [] | Additional containers configuration | | extraInit | object | {"modelDownload":{"enabled":true},"initContainers":[],"pvcStorage":"1Gi"} | Additional configuration for init containers | | extraInit.modelDownload | object | {"enabled":true} | Model download functionality configuration | | extraInit.modelDownload.enabled | bool | true | Enable automatic model download job and wait container | | extraInit.modelDownload.image | object | {"repository":"amazon/aws-cli","tag":"2.6.4","pullPolicy":"IfNotPresent"} | Image for model download operations | | extraInit.modelDownload.waitContainer | object | {} | Wait container configuration (command, args, env) | | extraInit.modelDownload.downloadJob | object | {} | Download job configuration (command, args, env) | | extraInit.initContainers | list | [] | Custom init containers (appended after model download if enabled) | | extraInit.pvcStorage | string | "1Gi" | Storage size for the PVC | | extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | (Optional) Path of the model on S3 | | extraInit.awsEc2MetadataDisabled | bool | true | (Optional) Disable AWS EC2 metadata service | | extraPorts | list | [] | Additional ports configuration | | gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used | | image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration | | image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command | | image.repository | string | "vllm/vllm-openai" | Image repository | | image.tag | string | "latest" | Image tag | | livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration | | livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive | | livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server | | livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server | | livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening | | livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated | | livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe | | maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration | | readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration | | readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready | | readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server | | readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server | | readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening | | readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated | | readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe | | replicaCount | int | 1 | Number of replicas | | resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration | | resources.limits."nvidia.com/gpu" | int | 1 | Number of GPUs used | | resources.limits.cpu | int | 4 | Number of CPUs | | resources.limits.memory | string | "16Gi" | CPU memory configuration | | resources.requests."nvidia.com/gpu" | int | 1 | Number of GPUs used | | resources.requests.cpu | int | 4 | Number of CPUs | | resources.requests.memory | string | "16Gi" | CPU memory configuration | | secrets | object | {} | Secrets configuration | | serviceName | string | "" | Service name | | servicePort | int | 80 | Service port | | labels.environment | string | test | Environment name | ## Configuration Examples ### Using S3 Model Download (Default) ```yaml extraInit: modelDownload: enabled: true pvcStorage: "10Gi" s3modelpath: "models/llama-7b" ``` ### Using Custom Init Containers Only For use cases like llm-d where you need custom sidecars without model download: ```yaml extraInit: modelDownload: enabled: false initContainers: - name: llm-d-routing-proxy image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: proxy securityContext: runAsUser: 1000 restartPolicy: Always pvcStorage: "10Gi" ``` --- # Hugging Face Inference Endpoints ## Overview Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Face Hub](https://huggingface.co) or directly from the [Inference Endpoints](https://endpoints.huggingface.co/) interface. This allows you to serve models in a fully managed environment with GPU acceleration, auto-scaling, and monitoring, without managing the infrastructure manually. For advanced details on vLLM integration and deployment options, see [Advanced Deployment Details](#advanced-deployment-details). ## Deployment Methods - [**Method 1: Deploy from the Catalog.**](#method-1-deploy-from-the-catalog) One-click deploy models from the Hugging Face Hub with ready-made optimized configurations. - [**Method 2: Guided Deployment (Transformers Models).**](#method-2-guided-deployment-transformers-models) Instantly deploy models tagged with `transformers` from the Hub UI using the **Deploy** button. - [**Method 3: Manual Deployment (Advanced Models).**](#method-3-manual-deployment-advanced-models) For models that either use custom code with the `transformers` tag, or don’t run with standard `transformers` but are supported by vLLM. This method requires manual configuration. ### Method 1: Deploy from the Catalog This is the easiest way to get started with vLLM on Hugging Face Inference Endpoints. You can browse a catalog of models with verified and optimized deployment configuration at [Inference Endpoints](https://endpoints.huggingface.co/catalog) to maximize performance. 1. Go to [Endpoints Catalog](https://endpoints.huggingface.co/catalog) and in the **Inference Server** options, select `vLLM`.This will display the current list of models with optimized preconfigured options. ![Endpoints Catalog](../../assets/deployment/hf-inference-endpoints-catalog.png) 1. Select the desired model and click **Create Endpoint**. ![Create Endpoint](../../assets/deployment/hf-inference-endpoints-create-endpoint.png) 1. Once the deployment is ready, you can use the endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console, remembering to append `/v1` as required. ```python # pip install openai from openai import OpenAI import os client = OpenAI( base_url=DEPLOYMENT_URL, api_key=os.environ["HF_TOKEN"], # https://huggingface.co/settings/tokens ) chat_completion = client.chat.completions.create( model="HuggingFaceTB/SmolLM3-3B", messages=[ { "role": "user", "content": [ { "type": "text", "text": "Give me a brief explanation of gravity in simple terms.", } ], } ], stream=True, ) for message in chat_completion: print(message.choices[0].delta.content, end="") ``` !!! note The catalog provides models optimized for vLLM, including GPU settings and inference engine configurations. You can monitor the endpoint and update the **container or its configuration** from the Inference Endpoints UI. ### Method 2: Guided Deployment (Transformers Models) This method applies to models with the [`transformers` library tag](https://huggingface.co/models?library=transformers) in their metadata. It allows you to deploy a model directly from the Hub UI without manual configuration. 1. Navigate to a model on [Hugging Face Hub](https://huggingface.co/models). For this example we will use the [`ibm-granite/granite-docling-258M`](https://huggingface.co/ibm-granite/granite-docling-258M) model. You can verify that the model is compatible by checking the front matter in the [README](https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/README.md), where the library is tagged as `library: transformers`. 2. Locate the **Deploy** button. The button appears for models tagged with `transformers` at the top right of the [model card](https://huggingface.co/ibm-granite/granite-docling-258M). ![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png) 3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. ![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png) 4. Select the Hardware (we choose AWS>GPU>T4 for the example) and Container Configuration. Choose `vLLM` as the container type and finalize the deployment pressing **Create Endpoint**. ![Select Hardware](../../assets/deployment/hf-inference-endpoints-select-hardware.png) 5. Use the deployed endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console (remember to add `/v1` needed). You can then use your endpoint programmatically or via the SDK. ```python # pip install openai from openai import OpenAI import os client = OpenAI( base_url=DEPLOYMENT_URL, api_key=os.environ["HF_TOKEN"], # https://huggingface.co/settings/tokens ) chat_completion = client.chat.completions.create( model="ibm-granite/granite-docling-258M", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png", }, }, { "type": "text", "text": "Convert this page to docling.", }, ] } ], stream=True, ) for message in chat_completion: print(message.choices[0].delta.content, end="") ``` !!! note This method uses best-guess defaults. You may need to adjust the configuration to fit your specific requirements. ### Method 3: Manual Deployment (Advanced Models) Some models require manual deployment because they: - Use custom code with the `transformers` tag - Don't run with standard `transformers` but are supported by `vLLM` These models cannot be deployed using the **Deploy** button on the model card. In this guide, we demonstrate manual deployment using the [`rednote-hilab/dots.ocr`](https://huggingface.co/rednote-hilab/dots.ocr) model, an OCR model integrated with vLLM (see vLLM [PR](https://github.com/vllm-project/vllm/pull/24645)). 1. Start a new deployment. Go to [Inference Endpoints](https://endpoints.huggingface.co/) and click `New`. ![New Endpoint](../../assets/deployment/hf-inference-endpoints-new-endpoint.png) 2. Search the model in the Hub. In the dialog, switch to **Hub** and search for the desired model. ![Select model](../../assets/deployment/hf-inference-endpoints-select-model.png) 3. Choosing infrastructure. On the configuration page, select the cloud provider and hardware from the available options. For this demo, we choose AWS and L4 GPU. Adjust according to your hardware needs. ![Choose Infra](../../assets/deployment/hf-inference-endpoints-choose-infra.png) 4. Configure the container. Scroll to the **Container Configuration** and select `vLLM` as the container type. ![Configure Container](../../assets/deployment/hf-inference-endpoints-configure-container.png) 5. Create the endpoint. Click **Create Endpoint** to deploy the model. Once the endpoint is ready, you can use it with the OpenAI Completion API, cURL, or other SDKs. Remember to append `/v1` to the deployment URL if needed. !!! note You can adjust the **container settings** (Container URI, Container Arguments) from the Inference Endpoints UI and press **Update Endpoint**. This redeploys the endpoint with the updated container configuration. Changes to the model itself require creating a new endpoint or redeploying with a different model. For example, for this demo, you may need to update the Container URI to the nightly image (`vllm/vllm-openai:nightly`) and add the `--trust-remote-code` flag in the container arguments. ## Advanced Deployment Details With the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications. Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts. The platform integrates seamlessly with the Hugging Face Hub, allowing you to deploy any vLLM- or `transformers`-compatible model, track usage, and update the inference engine directly. The vLLM engine comes preconfigured, enabling optimized inference and easy switching between models or engines without modifying your code. This setup simplifies production deployment: endpoints are ready in minutes, include monitoring and logging, and let you focus on serving models rather than maintaining infrastructure. ## Next Steps - Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog - Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index) - Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm) - Understand the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html) --- # LiteLLM [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.] LiteLLM manages: - Translate inputs to provider's `completion`, `embedding`, and `image_generation` endpoints - [Consistent output](https://docs.litellm.ai/docs/completion/output), text responses will always be available at `['choices'][0]['message']['content']` - Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - [Router](https://docs.litellm.ai/docs/routing) - Set Budgets & Rate limits per project, api key, model [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/simple_proxy) And LiteLLM supports all models on VLLM. ## Prerequisites Set up the vLLM and litellm environment: ```bash pip install vllm litellm ``` ## Deploy ### Chat completion 1. Start the vLLM server with the supported chat completion model, e.g. ```bash vllm serve qwen/Qwen1.5-0.5B-Chat ``` 1. Call it with litellm: ??? code ```python import litellm messages = [{"content": "Hello, how are you?", "role": "user"}] # hosted_vllm is prefix key word and necessary response = litellm.completion( model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name messages=messages, api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1", temperature=0.2, max_tokens=80, ) print(response) ``` ### Embeddings 1. Start the vLLM server with the supported embedding model, e.g. ```bash vllm serve BAAI/bge-base-en-v1.5 ``` 1. Call it with litellm: ```python from litellm import embedding import os os.environ["HOSTED_VLLM_API_BASE"] = "http://{your-vllm-server-host}:{your-vllm-server-port}/v1" # hosted_vllm is prefix key word and necessary # pass the vllm model name embedding = embedding(model="hosted_vllm/BAAI/bge-base-en-v1.5", input=["Hello world"]) print(embedding) ``` For details, see the tutorial [Using vLLM in LiteLLM](https://docs.litellm.ai/docs/providers/vllm). --- # Lobe Chat [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework. Supports speech-synthesis, multi-modal, and extensible (function call) plugin system. One-click FREE deployment of your private OpenAI ChatGPT/Claude/Gemini/Groq/Ollama chat application. It supports vLLM as an AI model provider to efficiently serve large language models. For details, see the tutorial [Using vLLM in LobeChat](https://lobehub.com/docs/usage/providers/vllm). --- # LWS LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference. vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving. ## Prerequisites * At least two Kubernetes nodes, each with 8 GPUs, are required. * Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/). ## Deploy and Serve Deploy the following yaml file `lws.yaml` ??? code "Yaml" ```yaml apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 1 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: docker.io/vllm/vllm-openai:latest env: - name: HF_TOKEN value: command: - sh - -c - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline_parallel_size 2" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi workerTemplate: spec: containers: - name: vllm-worker image: docker.io/vllm/vllm-openai:latest command: - sh - -c - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 env: - name: HF_TOKEN value: volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: http port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader type: ClusterIP ``` ```bash kubectl apply -f lws.yaml ``` Verify the status of the pods: ```bash kubectl get pods ``` Should get an output similar to this: ```bash NAME READY STATUS RESTARTS AGE vllm-0 1/1 Running 0 2s vllm-0-1 1/1 Running 0 2s ``` Verify that the distributed tensor-parallel inference works: ```bash kubectl logs vllm-0 |grep -i "Loading model weights took" ``` Should get something similar to this: ```text INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB (RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB ``` ## Access ClusterIP service ```bash # Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service kubectl port-forward svc/vllm-leader 8080:8080 ``` The output should be similar to the following: ```text Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080 ``` ## Serve the model Open another terminal and send a request ```text curl http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' ``` The output should be similar to the following ??? console "Output" ```text { "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", "object": "text_completion", "created": 1715138766, "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "choices": [ { "index": 0, "text": " top destination for foodies, with", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 12, "completion_tokens": 7 } } ``` --- # Modal vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling. For details on how to deploy vLLM on Modal, see [this tutorial in the Modal documentation](https://modal.com/docs/examples/vllm_inference). --- # Open WebUI [Open WebUI](https://github.com/open-webui/open-webui) is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with built-in RAG capabilities, making it a powerful AI deployment solution. To get started with Open WebUI using vLLM, follow these steps: 1. Install the [Docker](https://docs.docker.com/engine/install/). 2. Start the vLLM server with a supported chat completion model: ```console vllm serve Qwen/Qwen3-0.6B-Chat ``` !!! note When starting the vLLM server, be sure to specify the host and port using the `--host` and `--port` flags. For example: ```console vllm serve --host 0.0.0.0 --port 8000 ``` 3. Start the Open WebUI Docker container: ```console docker run -d \ --name open-webui \ -p 3000:8080 \ -v open-webui:/app/backend/data \ -e OPENAI_API_BASE_URL=http://0.0.0.0:8000/v1 \ --restart always \ ghcr.io/open-webui/open-webui:main ``` 4. Open it in the browser: At the top of the page, you should see the model `Qwen/Qwen3-0.6B-Chat`. ![Web portal of model Qwen/Qwen3-0.6B-Chat](../../assets/deployment/open_webui.png) --- # Retrieval-Augmented Generation [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. Here are the integrations: - vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus) ## vLLM + langchain ### Prerequisites Set up the vLLM and langchain environment: ```bash pip install -U vllm \ langchain_milvus langchain_openai \ langchain_community beautifulsoup4 \ langchain-text-splitters ``` ### Deploy 1. Start the vLLM server with the supported embedding model, e.g. ```bash # Start embedding service (port 8000) vllm serve ssmits/Qwen2-7B-Instruct-embed-base ``` 1. Start the vLLM server with the supported chat completion model, e.g. ```bash # Start chat service (port 8001) vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 ``` 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_langchain.py](../../../examples/online_serving/retrieval_augmented_generation_with_langchain.py) 1. Run the script ```bash python retrieval_augmented_generation_with_langchain.py ``` ## vLLM + llamaindex ### Prerequisites Set up the vLLM and llamaindex environment: ```bash pip install vllm \ llama-index llama-index-readers-web \ llama-index-llms-openai-like \ llama-index-embeddings-openai-like \ llama-index-vector-stores-milvus \ ``` ### Deploy 1. Start the vLLM server with the supported embedding model, e.g. ```bash # Start embedding service (port 8000) vllm serve ssmits/Qwen2-7B-Instruct-embed-base ``` 1. Start the vLLM server with the supported chat completion model, e.g. ```bash # Start chat service (port 8001) vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 ``` 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_llamaindex.py](../../../examples/online_serving/retrieval_augmented_generation_with_llamaindex.py) 1. Run the script: ```bash python retrieval_augmented_generation_with_llamaindex.py ``` --- # SkyPilot

vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc., can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html). ## Prerequisites - Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`. - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)). - Check that `sky check` shows clouds or Kubernetes are enabled. ```bash pip install skypilot-nightly sky check ``` ## Run on a single instance See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml). ??? code "Yaml" ```yaml resources: accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. use_spot: True disk_size: 512 # Ensure model checkpoints can fit. disk_tier: best ports: 8081 # Expose to internet traffic. envs: PYTHONUNBUFFERED: 1 MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct HF_TOKEN: # Change to your own huggingface token, or use --env to pass. setup: | conda create -n vllm python=3.10 -y conda activate vllm pip install vllm==0.4.0.post1 # Install Gradio for web UI. pip install gradio openai pip install flash-attn==2.5.7 run: | conda activate vllm echo 'Starting vllm api server...' vllm serve $MODEL_NAME \ --port 8081 \ --trust-remote-code \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ 2>&1 | tee api_server.log & echo 'Waiting for vllm api server to start...' while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done echo 'Starting gradio server...' git clone https://github.com/vllm-project/vllm.git || true python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ -m $MODEL_NAME \ --port 8811 \ --model-url http://localhost:8081/v1 \ --stop-token-ids 128009,128001 ``` Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): ```bash HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN ``` Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion. ```console (task, pid=7431) Running on public URL: https://.gradio.live ``` **Optional**: Serve the 70B model instead of the default 8B and use more GPU: ```bash HF_TOKEN="your-huggingface-token" \ sky launch serving.yaml \ --gpus A100:8 \ --env HF_TOKEN \ --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct ``` ## Scale up to multiple replicas SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. ??? code "Yaml" ```yaml service: replicas: 2 # An actual request for readiness probe. readiness_probe: path: /v1/chat/completions post_data: model: $MODEL_NAME messages: - role: user content: Hello! What is your name? max_completion_tokens: 1 ``` ??? code "Yaml" ```yaml service: replicas: 2 # An actual request for readiness probe. readiness_probe: path: /v1/chat/completions post_data: model: $MODEL_NAME messages: - role: user content: Hello! What is your name? max_completion_tokens: 1 resources: accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. use_spot: True disk_size: 512 # Ensure model checkpoints can fit. disk_tier: best ports: 8081 # Expose to internet traffic. envs: PYTHONUNBUFFERED: 1 MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct HF_TOKEN: # Change to your own huggingface token, or use --env to pass. setup: | conda create -n vllm python=3.10 -y conda activate vllm pip install vllm==0.4.0.post1 # Install Gradio for web UI. pip install gradio openai pip install flash-attn==2.5.7 run: | conda activate vllm echo 'Starting vllm api server...' vllm serve $MODEL_NAME \ --port 8081 \ --trust-remote-code \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ 2>&1 | tee api_server.log ``` Start the serving the Llama-3 8B model on multiple replicas: ```bash HF_TOKEN="your-huggingface-token" \ sky serve up -n vllm serving.yaml \ --env HF_TOKEN ``` Wait until the service is ready: ```bash watch -n10 sky serve status vllm ``` Example outputs: ```console Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT vllm 1 35s READY 2/2 xx.yy.zz.100:30001 Service Replicas SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4 vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4 ``` After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: ??? console "Commands" ```bash ENDPOINT=$(sky serve status --endpoint 8081 vllm) curl -L http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Who are you?" } ], "stop_token_ids": [128009, 128001] }' ``` To enable autoscaling, you could replace the `replicas` with the following configs in `service`: ```yaml service: replica_policy: min_replicas: 2 max_replicas: 4 target_qps_per_replica: 2 ``` This will scale the service up to when the QPS exceeds 2 for each replica. ??? code "Yaml" ```yaml service: replica_policy: min_replicas: 2 max_replicas: 4 target_qps_per_replica: 2 # An actual request for readiness probe. readiness_probe: path: /v1/chat/completions post_data: model: $MODEL_NAME messages: - role: user content: Hello! What is your name? max_completion_tokens: 1 resources: accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. use_spot: True disk_size: 512 # Ensure model checkpoints can fit. disk_tier: best ports: 8081 # Expose to internet traffic. envs: PYTHONUNBUFFERED: 1 MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct HF_TOKEN: # Change to your own huggingface token, or use --env to pass. setup: | conda create -n vllm python=3.10 -y conda activate vllm pip install vllm==0.4.0.post1 # Install Gradio for web UI. pip install gradio openai pip install flash-attn==2.5.7 run: | conda activate vllm echo 'Starting vllm api server...' vllm serve $MODEL_NAME \ --port 8081 \ --trust-remote-code \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ 2>&1 | tee api_server.log ``` To update the service with the new config: ```bash HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN ``` To stop the service: ```bash sky serve down vllm ``` ### **Optional**: Connect a GUI to the endpoint It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. ??? code "Yaml" ```yaml envs: MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. resources: cpus: 2 setup: | conda create -n vllm python=3.10 -y conda activate vllm # Install Gradio for web UI. pip install gradio openai run: | conda activate vllm export PATH=$PATH:/sbin echo 'Starting gradio server...' git clone https://github.com/vllm-project/vllm.git || true python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ -m $MODEL_NAME \ --port 8811 \ --model-url http://$ENDPOINT/v1 \ --stop-token-ids 128009,128001 | tee ~/gradio.log ``` 1. Start the chat web UI: ```bash sky launch \ -c gui ./gui.yaml \ --env ENDPOINT=$(sky serve status --endpoint vllm) ``` 2. Then, we can access the GUI at the returned gradio link: ```console | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live ``` --- # Streamlit [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. It can be quickly integrated with vLLM as a backend API server, enabling powerful LLM inference via API calls. ## Prerequisites Set up the vLLM environment by installing all required packages: ```bash pip install vllm streamlit openai ``` ## Deploy 1. Start the vLLM server with a supported chat completion model, e.g. ```bash vllm serve Qwen/Qwen1.5-0.5B-Chat ``` 1. Use the script: [examples/online_serving/streamlit_openai_chatbot_webserver.py](../../../examples/online_serving/streamlit_openai_chatbot_webserver.py) 1. Start the streamlit web UI and start to chat: ```bash streamlit run streamlit_openai_chatbot_webserver.py # or specify the VLLM_API_BASE or VLLM_API_KEY VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \ streamlit run streamlit_openai_chatbot_webserver.py # start with debug mode to view more details streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug ``` ![Chat with vLLM assistant in Streamlit](../../assets/deployment/streamlit-chat.png) --- # NVIDIA Triton The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. --- # KAITO [KAITO](https://kaito-project.github.io/kaito/docs/) is a Kubernetes operator that supports deploying and serving LLMs with vLLM. It offers managing large models via container images with built-in OpenAI-compatible inference, auto-provisioning GPU nodes and curated model presets. Please refer to [quick start](https://kaito-project.github.io/kaito/docs/quick-start) for more details. --- # KServe vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. You can use vLLM with KServe's [Hugging Face serving runtime](https://kserve.github.io/website/docs/model-serving/generative-inference/overview) or via [`LLMInferenceService` that uses llm-d](https://kserve.github.io/website/docs/model-serving/generative-inference/llmisvc/llmisvc-overview). --- # Kthena [**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads. This guide shows how to deploy a production-grade, **multi-node vLLM** service on Kubernetes. We’ll: - Install the required components (Kthena + Volcano). - Deploy a multi-node vLLM model via Kthena’s `ModelServing` CR. - Validate the deployment. --- ## 1. Prerequisites You need: - A Kubernetes cluster with **GPU nodes**. - `kubectl` access with cluster-admin or equivalent permissions. - **Volcano** installed for gang scheduling. - **Kthena** installed with the `ModelServing` CRD available. - A valid **Hugging Face token** if loading models from Hugging Face Hub. ### 1.1 Install Volcano ```bash helm repo add volcano-sh https://volcano-sh.github.io/helm-charts helm repo update helm install volcano volcano-sh/volcano -n volcano-system --create-namespace ``` This provides the gang-scheduling and network topology features used by Kthena. ### 1.2 Install Kthena ```bash helm install kthena oci://ghcr.io/volcano-sh/charts/kthena --version v0.1.0 --namespace kthena-system --create-namespace ``` - The `kthena-system` namespace is created. - Kthena controllers and CRDs, including `ModelServing`, are installed and healthy. Validate: ```bash kubectl get crd | grep modelserving ``` You should see: ```text modelservings.workload.serving.volcano.sh ... ``` --- ## 2. The Multi-Node vLLM `ModelServing` Example Kthena provides an example manifest to deploy a **multi-node vLLM cluster running Llama**. Conceptually this is equivalent to the vLLM production stack Helm deployment, but expressed with `ModelServing`. A simplified version of the example (`llama-multinode`) looks like: - `spec.replicas: 1` – one `ServingGroup` (one logical model deployment). - `roles`: - `entryTemplate` – defines **leader** pods that run: - vLLM’s **multi-node cluster bootstrap script** (Ray cluster). - vLLM **OpenAI-compatible API server**. - `workerTemplate` – defines **worker** pods that join the leader’s Ray cluster. Key points from the example YAML: - **Image**: `vllm/vllm-openai:latest` (matches upstream vLLM images). - **Command** (leader): ```yaml command: - sh - -c - > bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2; python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2 ``` - **Command** (worker): ```yaml command: - sh - -c - > bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS) ``` --- ## 3. Deploying Multi-Node llama vLLM via Kthena ### 3.1 Prepare the Manifest **Recommended**: use a Secret instead of a raw env var: ```bash kubectl create secret generic hf-token \ -n default \ --from-literal=HUGGING_FACE_HUB_TOKEN='' ``` ### 3.2 Apply the `ModelServing` ```bash cat <---`. The first number indicates `ServingGroup`. The second (`405b`) is the `Role`. The remaining indices identify the pod within the role. --- ## 6. Accessing the vLLM OpenAI-Compatible API Expose the entry via a Service: ```yaml apiVersion: v1 kind: Service metadata: name: llama-multinode-openai namespace: default spec: selector: modelserving.volcano.sh/name: llama-multinode modelserving.volcano.sh/entry: "true" # optionally further narrow to leader role if you label it ports: - name: http port: 80 targetPort: 8080 type: ClusterIP ``` Port-forward from your local machine: ```bash kubectl port-forward svc/llama-multinode-openai 30080:80 -n default ``` Then: - List models: ```bash curl -s http://localhost:30080/v1/models ``` - Send a completion request (mirroring vLLM production stack docs): ```bash curl -X POST http://localhost:30080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-405B-Instruct", "prompt": "Once upon a time,", "max_tokens": 10 }' ``` You should see an OpenAI-style response from vLLM. --- ## 7. Clean Up To remove the deployment and its resources: ```bash kubectl delete modelserving llama-multinode -n default ``` If you’re done with the entire stack: ```bash helm uninstall kthena -n kthena-system # or your Kthena release name helm uninstall volcano -n volcano-system ``` --- # KubeAI [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. Please see the Installation Guides for environment specific instructions: - [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/) - [EKS](https://www.kubeai.org/installation/eks/) - [GKE](https://www.kubeai.org/installation/gke/) Once you have KubeAI installed, you can [configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/) using vLLM. --- # KubeRay [KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clusters. A Ray cluster can be declared in YAML, and the operator then handles pod scheduling, networking configuration, restarts, and blue-green deployments — all while preserving the familiar Kubernetes experience. ## Why KubeRay instead of manual scripts? | Feature | Manual scripts | KubeRay | |---------|-----------------------------------------------------------|---------| | Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` | | Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size | | Upgrades | Tear down & re-create manually | Blue/green deployment updates supported | | Declarative config | Bash flags & environment variables | Git-ops-friendly YAML CRDs (RayCluster/RayService) | Using KubeRay reduces the operational burden and simplifies integration of Ray + vLLM with existing Kubernetes workflows (CI/CD, secrets, storage classes, etc.). ## Learn more * ["Serve a Large Language Model using Ray Serve LLM on Kubernetes"](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayserve-llm-example.html) - An end-to-end example of how to serve a model using vLLM, KubeRay, and Ray Serve. * [KubeRay documentation](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) --- # Llama Stack vLLM is also available via [Llama Stack](https://github.com/llamastack/llama-stack). To install Llama Stack, run ```bash pip install llama-stack -q ``` ## Inference using OpenAI-Compatible API Then start the Llama Stack server and configure it to point to your vLLM server with the following settings: ```yaml inference: - provider_id: vllm0 provider_type: remote::vllm config: url: http://127.0.0.1:8000 ``` Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_vllm.html) for more details on this remote vLLM provider. ## Inference using Embedded vLLM An [inline provider](https://github.com/llamastack/llama-stack/tree/main/llama_stack/providers/inline/inference) is also available. This is a sample of configuration using that method: ```yaml inference: - provider_type: vllm config: model: Llama3.1-8B-Instruct tensor_parallel_size: 4 ``` --- # llm-d vLLM can be deployed with [llm-d](https://github.com/llm-d/llm-d), a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale. It helps achieve the fastest "time to state-of-the-art (SOTA) performance" for key OSS models across most hardware accelerators and infrastructure providers. You can use vLLM with llm-d directly by following [this guide](https://llm-d.ai/docs/guide) or via [KServe's LLMInferenceService](https://kserve.github.io/website/docs/model-serving/generative-inference/llmisvc/llmisvc-overview). --- # llmaz [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend. Please refer to the [Quick Start](https://github.com/InftyAI/llmaz?tab=readme-ov-file#quick-start) for more details. --- # Production stack Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: * **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code. * **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards. * **High performance** – Optimized for LLM workloads with features like multimodel support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others. If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**! ## Pre-requisite Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). ## Deployment using vLLM production stack The standard vLLM production stack is installed using a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/utils/install-helm.sh) to install Helm on your GPU server. To install the vLLM production stack, run the following commands on your desktop: ```bash sudo helm repo add vllm https://vllm-project.github.io/production-stack sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml ``` This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model). ### Validate Installation Monitor the deployment status using: ```bash sudo kubectl get pods ``` And you will see that pods for the `vllm` deployment will transit to `Running` state. ```text NAME READY STATUS RESTARTS AGE vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s ``` !!! note It may take some time for the containers to download the Docker images and LLM weights. ### Send a Query to the Stack Forward the `vllm-router-service` port to the host machine: ```bash sudo kubectl port-forward svc/vllm-router-service 30080:80 ``` And then you can send out a query to the OpenAI-compatible API to check the available models: ```bash curl -o- http://localhost:30080/v1/models ``` ??? console "Output" ```json { "object": "list", "data": [ { "id": "facebook/opt-125m", "object": "model", "created": 1737428424, "owned_by": "vllm", "root": null } ] } ``` To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint: ```bash curl -X POST http://localhost:30080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "facebook/opt-125m", "prompt": "Once upon a time,", "max_tokens": 10 }' ``` ??? console "Output" ```json { "id": "completion-id", "object": "text_completion", "created": 1737428424, "model": "facebook/opt-125m", "choices": [ { "text": " there was a brave knight who...", "index": 0, "finish_reason": "length" } ] } ``` ### Uninstall To remove the deployment, run: ```bash sudo helm uninstall vllm ``` --- ### (Advanced) Configuring vLLM production stack The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: ??? code "Yaml" ```yaml servingEngineSpec: runtimeClassName: "" modelSpec: - name: "opt125m" repository: "vllm/vllm-openai" tag: "latest" modelURL: "facebook/opt-125m" replicaCount: 1 requestCPU: 6 requestMemory: "16Gi" requestGPU: 1 pvcStorage: "10Gi" ``` In this YAML configuration: * **`modelSpec`** includes: * `name`: A nickname that you prefer to call the model. * `repository`: Docker repository of vLLM. * `tag`: Docker image tag. * `modelURL`: The LLM model that you want to use. * **`replicaCount`**: Number of replicas. * **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. * **`requestGPU`**: Specifies the number of GPUs required. * **`pvcStorage`**: Allocates persistent storage for the model. !!! note If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml). !!! tip vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details! --- # Using Kubernetes Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. - [Deployment with CPUs](#deployment-with-cpus) - [Deployment with GPUs](#deployment-with-gpus) - [Troubleshooting](#troubleshooting) - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) - [Conclusion](#conclusion) Alternatively, you can deploy vLLM to Kubernetes using any of the following: - [Helm](frameworks/helm.md) - [InftyAI/llmaz](integrations/llmaz.md) - [llm-d](integrations/llm-d.md) - [KAITO](integrations/kaito.md) - [KServe](integrations/kserve.md) - [Kthena](integrations/kthena.md) - [KubeRay](integrations/kuberay.md) - [kubernetes-sigs/lws](frameworks/lws.md) - [meta-llama/llama-stack](integrations/llamastack.md) - [substratusai/kubeai](integrations/kubeai.md) - [vllm-project/aibrix](https://github.com/vllm-project/aibrix) - [vllm-project/production-stack](integrations/production-stack.md) ## Deployment with CPUs !!! note The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs. First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: ??? console "Config" ```bash cat <
Yaml
```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: mistral-7b namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: default volumeMode: Filesystem ``` Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models ```yaml apiVersion: v1 kind: Secret metadata: name: hf-token-secret namespace: default type: Opaque stringData: token: "REPLACE_WITH_TOKEN" ``` Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model. Here are two examples for using NVIDIA GPU and AMD GPU. NVIDIA GPU:

Yaml
```yaml apiVersion: apps/v1 kind: Deployment metadata: name: mistral-7b namespace: default labels: app: mistral-7b spec: replicas: 1 selector: matchLabels: app: mistral-7b template: metadata: labels: app: mistral-7b spec: volumes: - name: cache-volume persistentVolumeClaim: claimName: mistral-7b # vLLM needs to access the host's shared memory for tensor parallel inference. - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: mistral-7b image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" ] env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "10" memory: 20G nvidia.com/gpu: "1" requests: cpu: "2" memory: 6G nvidia.com/gpu: "1" volumeMounts: - mountPath: /root/.cache/huggingface name: cache-volume - name: shm mountPath: /dev/shm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 5 ```
AMD GPU: You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.

Yaml
```yaml apiVersion: apps/v1 kind: Deployment metadata: name: mistral-7b namespace: default labels: app: mistral-7b spec: replicas: 1 selector: matchLabels: app: mistral-7b template: metadata: labels: app: mistral-7b spec: volumes: # PVC - name: cache-volume persistentVolumeClaim: claimName: mistral-7b # vLLM needs to access the host's shared memory for tensor parallel inference. - name: shm emptyDir: medium: Memory sizeLimit: "8Gi" hostNetwork: true hostIPC: true containers: - name: mistral-7b image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 securityContext: seccompProfile: type: Unconfined runAsGroup: 44 capabilities: add: - SYS_PTRACE command: ["/bin/sh", "-c"] args: [ "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" ] env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "10" memory: 20G amd.com/gpu: "1" requests: cpu: "6" memory: 6G amd.com/gpu: "1" volumeMounts: - name: cache-volume mountPath: /root/.cache/huggingface - name: shm mountPath: /dev/shm ```
You can get the full example with steps and sample yaml files from . 2. Create a Kubernetes Service for vLLM Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:

Yaml
```yaml apiVersion: v1 kind: Service metadata: name: mistral-7b namespace: default spec: ports: - name: http-mistral-7b port: 80 protocol: TCP targetPort: 8000 # The label selector should match the deployment labels & it is useful for prefix caching feature selector: app: mistral-7b sessionAffinity: None type: ClusterIP ```
3. Deploy and Test Apply the deployment and service configurations using `kubectl apply -f `: ```bash kubectl apply -f deployment.yaml kubectl apply -f service.yaml ``` To test the deployment, run the following `curl` command: ```bash curl http://mistral-7b.default.svc.cluster.local/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' ``` If the service is correctly deployed, you should receive a response from the vLLM model. ## Troubleshooting ### Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated" If the startup or readiness probe failureThreshold is too low for the time needed to start up the server, Kubernetes scheduler will kill the container. A couple of indications that this has happened: 1. container log contains "KeyboardInterrupt: terminated" 2. `kubectl get events` shows message `Container $NAME failed startup probe, will be restarted` To mitigate, increase the failureThreshold to allow more time for the model server to start serving. You can identify an ideal failureThreshold by removing the probes from the manifest and measuring how much time it takes for the model server to show it's ready to serve. ## Conclusion Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation. --- # Using Nginx This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. ## Build Nginx Container This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory. ```bash export vllm_root=`pwd` ``` Create a file named `Dockerfile.nginx`: ```dockerfile FROM nginx:latest RUN rm /etc/nginx/conf.d/default.conf EXPOSE 80 CMD ["nginx", "-g", "daemon off;"] ``` Build the container: ```bash docker build . -f Dockerfile.nginx --tag nginx-lb ``` ## Create Simple Nginx Config file Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`. ??? console "Config" ```console upstream backend { least_conn; server vllm0:8000 max_fails=3 fail_timeout=10000s; server vllm1:8000 max_fails=3 fail_timeout=10000s; } server { listen 80; location / { proxy_pass http://backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } ``` ## Build vLLM Container ```bash cd $vllm_root docker build -f docker/Dockerfile . --tag vllm ``` If you are behind proxy, you can pass the proxy settings to the docker build command as shown below: ```bash cd $vllm_root docker build \ -f docker/Dockerfile . \ --tag vllm \ --build-arg http_proxy=$http_proxy \ --build-arg https_proxy=$https_proxy ``` ## Create Docker Network ```bash docker network create vllm_nginx ``` ## Launch vLLM Containers Notes: - If you have your HuggingFace models cached somewhere else, update `hf_cache_dir` below. - If you don't have an existing HuggingFace cache you will want to start `vllm0` and wait for the model to complete downloading and the server to be ready. This will ensure that `vllm1` can leverage the model you just downloaded and it won't have to be downloaded again. - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command. - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`. ??? console "Commands" ```console mkdir -p ~/.cache/huggingface/hub/ hf_cache_dir=~/.cache/huggingface/ docker run \ -itd \ --ipc host \ --network vllm_nginx \ --gpus device=0 \ --shm-size=10.24gb \ -v $hf_cache_dir:/root/.cache/huggingface/ \ -p 8081:8000 \ --name vllm0 vllm \ --model meta-llama/Llama-2-7b-chat-hf docker run \ -itd \ --ipc host \ --network vllm_nginx \ --gpus device=1 \ --shm-size=10.24gb \ -v $hf_cache_dir:/root/.cache/huggingface/ \ -p 8082:8000 \ --name vllm1 vllm \ --model meta-llama/Llama-2-7b-chat-hf ``` !!! note If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`. ## Launch Nginx ```bash docker run \ -itd \ -p 8000:80 \ --network vllm_nginx \ -v ./nginx_conf/:/etc/nginx/conf.d/ \ --name nginx-lb nginx-lb:latest ``` ## Verify That vLLM Servers Are Ready ```bash docker logs vllm0 | grep Uvicorn docker logs vllm1 | grep Uvicorn ``` Both outputs should look like this: ```console INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` --- # Architecture Overview This document provides an overview of the vLLM architecture. [TOC] ## Entrypoints vLLM provides a number of entrypoints for interacting with the system. The following diagram shows the relationship between them. ![Entrypoints Diagram](../assets/design/arch_overview/entrypoints.excalidraw.png) ### LLM Class The LLM class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server. Here is a sample of `LLM` class usage: ??? code ```python from vllm import LLM, SamplingParams # Define a list of input prompts prompts = [ "Hello, my name is", "The capital of France is", "The largest ocean is", ] # Define sampling parameters sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Initialize the LLM engine with the OPT-125M model llm = LLM(model="facebook/opt-125m") # Generate outputs for the input prompts outputs = llm.generate(prompts, sampling_params) # Print the generated outputs for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` More API details can be found in the [Offline Inference](../api/README.md#offline-inference) section of the API docs. The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](../../vllm/entrypoints/llm.py). ### OpenAI-Compatible API Server The second primary interface to vLLM is via its OpenAI-compatible API server. This server can be started using the `vllm serve` command. ```bash vllm serve ``` The code for the `vllm` CLI can be found in [vllm/entrypoints/cli/main.py](../../vllm/entrypoints/cli/main.py). Sometimes you may see the API server entrypoint used directly instead of via the `vllm` CLI command. For example: ```bash python -m vllm.entrypoints.openai.api_server --model ``` !!! warning `python -m vllm.entrypoints.openai.api_server` is deprecated and may become unsupported in a future release. That code can be found in [vllm/entrypoints/openai/api_server.py](../../vllm/entrypoints/openai/api_server.py). More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document. ## LLM Engine The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of the vLLM system, handling model inference and asynchronous request processing. ![LLMEngine Diagram](../assets/design/arch_overview/llm_engine.excalidraw.png) ### LLMEngine The `LLMEngine` class is the core component of the vLLM engine. It is responsible for receiving requests from clients and generating outputs from the model. The `LLMEngine` includes input processing, model execution (possibly distributed across multiple hosts and/or GPUs), scheduling, and output processing. - **Input Processing**: Handles tokenization of input text using the specified tokenizer. - **Scheduling**: Chooses which requests are processed in each step. - **Model Execution**: Manages the execution of the language model, including distributed execution across multiple GPUs. - **Output Processing**: Processes the outputs generated by the model, decoding the token IDs from a language model into human-readable text. The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py](../../vllm/engine/llm_engine.py). ### AsyncLLMEngine The `AsyncLLMEngine` class is an asynchronous wrapper for the `LLMEngine` class. It uses `asyncio` to create a background loop that continuously processes incoming requests. The `AsyncLLMEngine` is designed for online serving, where it can handle multiple concurrent requests and stream outputs to clients. The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo API server that serves as a simpler example in [vllm/entrypoints/api_server.py](../../vllm/entrypoints/api_server.py). The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py](../../vllm/engine/async_llm_engine.py). ## Worker A worker is a process that runs the model inference. vLLM follows the common practice of using one process to control one accelerator device, such as GPUs. For example, if we use tensor parallelism of size 2 and pipeline parallelism of size 2, we will have 4 workers in total. Workers are identified by their `rank` and `local_rank`. `rank` is used for global orchestration, while `local_rank` is mainly used for assigning the accelerator device and accessing local resources such as the file system and shared memory. ## Model Runner Every worker has one model runner object, responsible for loading and running the model. Much of the model execution logic resides here, such as preparing input tensors and capturing cudagraphs. ## Model Every model runner object has one model object, which is the actual `torch.nn.Module` instance. See [huggingface_integration](huggingface_integration.md) for how various configurations affect the class we ultimately get. ## Class Hierarchy The following figure shows the class hierarchy of vLLM: >
> ![](../assets/design/hierarchy.png){ align="center" alt="query" width="100%" } >
There are several important design choices behind this class hierarchy: 1\. **Extensibility**: All classes in the hierarchy accept a configuration object containing all the necessary information. The [VllmConfig](https://github.com/vllm-project/vllm/blob/d1c6799b8870e513bf4f2305cbf6cda9fc3d773b/vllm/config.py#L2036) class is the main configuration object that is passed around. The class hierarchy is quite deep, and every class needs to read the configuration it is interested in. By encapsulating all configurations in one object, we can easily pass the configuration object around and access the configuration we need. Suppose we want to add a new feature (this is often the case given how fast the field of LLM inference is evolving) that only touches the model runner. We will have to add a new configuration option in the `VllmConfig` class. Since we pass the whole config object around, we only need to add the configuration option to the `VllmConfig` class, and the model runner can access it directly. We don't need to change the constructor of the engine, worker, or model class to pass the new configuration option. 2\. **Uniformity**: The model runner needs a unified interface to create and initialize the model. vLLM supports more than 50 types of popular open-source models. Each model has its own initialization logic. If the constructor signature varies with models, the model runner does not know how to call the constructor accordingly, without complicated and error-prone inspection logic. By making the constructor of the model class uniform, the model runner can easily create and initialize the model without knowing the specific model type. This is also useful for composing models. Vision-language models often consist of a vision model and a language model. By making the constructor uniform, we can easily create a vision model and a language model and compose them into a vision-language model. !!! note To support this change, all vLLM models' signatures have been updated to: ```python def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): ``` To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: ??? code ```python class MyOldModel(nn.Module): def __init__( self, config, cache_config: Optional[CacheConfig] = None, quant_config: Optional[QuantizationConfig] = None, lora_config: Optional[LoRAConfig] = None, prefix: str = "", ) -> None: ... from vllm.config import VllmConfig class MyNewModel(MyOldModel): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): config = vllm_config.model_config.hf_config cache_config = vllm_config.cache_config quant_config = vllm_config.quant_config lora_config = vllm_config.lora_config super().__init__(config, cache_config, quant_config, lora_config, prefix) from packaging import version if version.parse(__version__) >= version.parse("0.6.4"): MyModel = MyNewModel else: MyModel = MyOldModel ``` This way, the model can work with both old and new versions of vLLM. 3\. **Sharding and Quantization at Initialization**: Certain features require changing the model weights. For example, tensor parallelism needs to shard the model weights, and quantization needs to quantize the model weights. There are two possible ways to implement this feature. One way is to change the model weights after the model is initialized. The other way is to change the model weights during the model initialization. vLLM chooses the latter. The first approach is not scalable to large models. Suppose we want to run a 405B model (with roughly 810GB weights) with 16 H100 80GB GPUs. Ideally, every GPU should only load 50GB weights. If we change the model weights after the model is initialized, we need to load the full 810GB weights to every GPU and then shard the weights, leading to a huge memory overhead. Instead, if we shard the weights during the model initialization, every layer will only create a shard of the weights it needs, leading to a much smaller memory overhead. The same idea applies to quantization. Note that we also add an additional argument `prefix` to the model's constructor so that the model can initialize itself differently based on the prefix. This is useful for non-uniform quantization, where different parts of the model are quantized differently. The `prefix` is usually an empty string for the top-level model and a string like `"vision"` or `"language"` for the sub-models. In general, it matches the name of the module's state dict in the checkpoint file. One disadvantage of this design is that it is hard to write unit tests for individual components in vLLM because every component needs to be initialized by a complete config object. We solve this problem by providing a default initialization function that creates a default config object with all fields set to `None`. If the component we want to test only cares about a few fields in the config object, we can create a default config object and set the fields we care about. This way, we can test the component in isolation. Note that many tests in vLLM are end-to-end tests that test the whole system, so this is not a big problem. In summary, the complete config object `VllmConfig` can be treated as an engine-level global state that is shared among all vLLM classes. --- # CUDA Graphs This write-up introduces the new CUDA Graphs modes in vLLM v1 beyond previous [torch.compile integration](torch_compile.md). To summarize, we: 1. Added flexible `cudagraph_mode` configuration 2. Made full CUDA Graphs support orthogonal to compilation 3. Introduced a CUDA Graphs dispatcher as a central controller that picks the desired runtime mode and CUDA Graphs per batch automatically In this document we will discuss the: * [Motivation](#motivation) * [CUDA Graphs modes](#cudagraphmodes) * [Detailed design](#detailed-design) * [Example usage of the different CUDA Graphs modes](#usage-guide) !!! note In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches). !!! note The following contents are mostly based on the last commit of . ## Motivation Initial piecewise compilation was built to allow piecewise cudagraph capture, excluding cudagraph-unsupported operations (mainly attention). This allowed some speedup from cudagraphs while maintaining compatibility with all attention backends. We later added support for "full cudagraphs" by not compiling piecewise, so that we could further reduce the latency in cases where attention supported cudagraphs. However, this tight coupling between compilation and cudagraph capture led to an all-or-nothing experience with little flexibility. Many attention backends also weren’t ready for unified "full" CUDA Graphs capture (e.g., only FlashAttention 3 supports it currently) or only support CUDA Graphs for pure decode batches (e.g., Flashinfer, FlashMLA, and Mamba, etc.). That led to confusing performance/compatibility tradeoffs, inconsistent CUDA Graphs support, and increasingly complex code structure. This led us to seek a more fine-grained CUDA Graphs solution with the following features: * Explicitly aware of CUDA Graphs for prefill/mixed or (uniform-)decode batch and capture them separately. * Separate CUDAGraph capture logic from compilation (as much as feasible) for feature orthogonality, which suggest: * Capturing piecewise and full cudagraphs using the same compiled graph, and * Full cudagraph capture without compilation. * Dispatch between full and piecewise cudagraph at runtime depending on batch composition. * Centralized control of CUDAGraph behavior for reduced code complexity and allowed more extendibility. These features allow the most flexibility for cudagraph capture and compilation for all kinds of startup/performance tradeoffs and feature support. ## `CudagraphModes` [CUDAGraphMode][vllm.config.compilation.CUDAGraphMode] is the single knob you tune in `CompilationConfig.cudagraph_mode`: * `NONE` — turn CUDA Graphs off. Good for debugging. * `PIECEWISE` — a single-mode strategy (and past default). It is the most flexible: attention or other CUDA Graphs-incompatible operations stay eager, everything else goes into CUDA Graphs. Requires piecewise compilation. * `FULL` — a single-mode strategy, which only captures full CUDA Graphs for non-uniform batches, then uniform-decode batches reuse the CUDA Graph of non-uniform batch of the same batch_size, since they are compatible; can be good for small models or workloads with small prompts. * `FULL_DECODE_ONLY` — full CUDA Graph for uniform decode, no cudagraph for prefill/mixed etc.; suitable for decode instances in a P/D setup where prefill is not as important, this way we can save the memory needed for `PIECEWISE` CUDA Graphs. * `FULL_AND_PIECEWISE` — (default mode) full CUDA Graph for uniform decode, piecewise CUDA Graphs for others; generally the most performant setting, especially for low latency with small models or MoEs, but also requires the most memory and takes the longest to capture. Defaults: If you’re on v1 with piecewise compilation, we default to `FULL_AND_PIECEWISE` for better performance, (for pooling models, it's still `PIECEWISE`). Otherwise, e.g. if piecewise compilation unavailable, we default to `NONE`. While `NONE` , `PIECEWISE`, and `FULL` are single-mode configurations and simply equivalent to past implementations of eager execution, piecewise CUDA Graphs, and full CUDA Graphs respectively, `FULL_DECODE_ONLY` and `FULL_AND_PIECEWISE` are newly appended dual-mode configurations, which require dispatching to switch between concrete runtime modes according to runtime batches dynamically. !!! note Here, the single-modes `NONE`, `PIECEWISE`, and `FULL` are treated as the runtime modes for CUDA Graphs dispatching. If using a dual-mode, the dispatcher will always dispatch to one of its member modes (plus a potential `NONE` if no suitable CUDA Graph available), depending on the batch composition. While cascade attention is not cudagraph compatible, it is now compatible with all possible cudagraph mode configurations. If a batch uses cascade attention, it always gets dispatched to `PIECEWISE` mode if available (otherwise `NONE`). !!! note Not all CUDA Graph modes are compatible with every attention backend. We automatically "downgrade" modes to the closest supported mode. For example, if a backend only supports CUDA Graphs for pure decode/uniform batches, we convert `FULL` to `FULL_AND_PIECEWISE` if piecewise compilation is enabled, and `FULL_DECODE_ONLY` otherwise. ## Detailed Design ### Overview The new CUDA Graphs logic is built on top of piecewise compilation and supports dual CUDA Graphs runtime mode switching. The system contains the following core components: * [CUDAGraphWrapper][vllm.compilation.cuda_graph.CUDAGraphWrapper]: wrapper that handles CUDAGraph capture & replay on the wrapped callable * [CudagraphDispatcher][vllm.v1.cudagraph_dispatcher.CudagraphDispatcher]: the central controller that contains the single source of truth about CUDA Graphs and handles dispatching between them. * [CUDAGraphMode][vllm.config.compilation.CUDAGraphMode]: enum describing the supported and runtime modes (introduced above). * [BatchDescriptor][vllm.forward_context.BatchDescriptor], serving as a unique representation of the runtime batch used for dispatching. See the following figures for a quick comparison between the previous and current design patterns of CUDA Graphs with inductor compilation. We can see that previously the CUDA Graphs logic and compilation logic were tightly coupled into the vllm `PiecewiseBackend`, and CUDA Graphs was implicitly dispatched by `batch_size` idly. Now the CUDA Graphs logic is separated into the `CUDAGraphWrapper` class, responsible for both full and piecewise CUDA Graphs abilities, and dispatching is **explicitly** done via **runtime mode** plus the `BatchDescriptor` as the **dispatch key** via `CudagraphDispatcher`. **Before:** ![previous_design](../assets/design/cuda_graphs/previous_design.png) **After:** ![new_design](../assets/design/cuda_graphs/current_design.png) ### `BatchDescriptor` [BatchDescriptor][vllm.forward_context.BatchDescriptor] is a component within `ForwardContext`, alongside the CUDA Graphs runtime modes, serving as the core structure for dispatching keys at runtime. The prototype is: ```python class BatchDescriptor(NamedTuple): num_tokens: int num_reqs: int uniform: bool = False has_lora: bool = False ``` where `num_tokens` can be the padded token length, and `uniform` indicates if all the requests have the same query lengths. Many attention backends only support full cudagraphs when the batches are uniform; pure decode batches are uniform but may not be query length 1 (i.e. `num_tokens == num_reqs`), this occurs in the validation pass of spec-decode where "decode" batches will have a query length of `1+num_spec_tokens`. The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. !!! note The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs). ### `CudagraphDispatcher` The [CudagraphDispatcher][vllm.v1.cudagraph_dispatcher.CudagraphDispatcher] takes responsibility for maintaining two sets of valid dispatching keys, one set for `FULL` runtime mode and one set for `PIECEWISE` runtime mode, and dispatches the correct runtime mode and the dispatching keys before executing the model's forwards. It will take in the initial key (a rough batch_descriptor for the padded input) and return the selected runtime mode and the final batch_descriptor, then tell the CUDAGraphWarpper instances that decision through forward contexts. Notice that `CudagraphDispatcher` is the only source of truth for available CUDA Graph keys and `CUDAGraphWrapper` instances can blindly trust the forward context on what CUDA Graphs to dispatch to. This lets us simplify the wrapper code and centralize the logic in the dispatcher. The dispatching keys are initialized through the dispatcher's `initialize_cudagraph_keys` method, which is called by the gpu_model_runner after all possible attention backends are initialized. This is where we can get much fancier in the future and “prepare” all kinds of CUDA Graphs combinations. For now, we just append available keys based on the valid combos of `decode_mode`/`mixed_mode` of `cudagraph_mode` and `cudagraph_capture_sizes` in the compilation config. The dispatch code looks like: ```python batch_descriptor=BatchDescriptor(num_tokens=num_input_tokens, uniform_decode=...) runtime_mode, batch_descriptor = cudagraphdispatcher.dispatch(batch_descriptor) # execution with set_forward_context( ..., cudagraph_runtime_mode=runtime_mode, batch_descriptor=batch_descriptor, ): output = self.model(...) ``` Inside the `dispatch()` method, the dispatcher will search the proper CUDA Graphs runtime mode and existing dispatching keys for a return. We basically search the existing keys following the priority: `FULL`>`PIECEWISE`>`None`. If the dispatching key does not exist, default to return `NONE` mode for eager execution. The implementations can be found [here](https://github.com/vllm-project/vllm/blob/main/vllm/v1/cudagraph_dispatcher.py#L91). Here is a simplified illustration of the workflow at runtime in the model executor: ![executor_runtime](../assets/design/cuda_graphs/executor_runtime.png) ### `CUDAGraphWrapper` A [CUDAGraphWrapper][vllm.compilation.cuda_graph.CUDAGraphWrapper] instance wraps a runnable and simply mimics the runnable with appended CUDA Graphs abilities. Each wrapper instance is bound to a specific `runtime_mode`, which is restricted to `PIECEWISE` and `FULL` mode, and takes responsibility for capturing/replaying and passing through (directly calling) the runnable. At runtime, each wrapper would: 1. inspect the runtime_mode and batch_descriptor(dispatching key) from the global forward context. 2. If runtime_mode is `NONE` or runtime_mode does not match the mode of the wrapper, just call the runnable directly. 3. Otherwise, i.e., the runtime_mode matches the mode of the wrapper, the wrapper will perform CUDA Graphs capture (if key does not exist, create a new entry and cache it) or replay (if key exists in the cache). The above steps are based on the assumption that the CUDA Graphs wrapper would directly trust what’s in the forward context (controlled by the dispatcher). This lets us simplify and centralize the logic, reducing the complexity as well as the risk of mismatched state between the wrappers and the dispatcher. It also allows reusing the wrapper class for both `FULL` and `PIECEWISE` runtime modes. See the implementation [here](https://github.com/vllm-project/vllm/blob/f751e50b7a2aae3110d83ed0d88202fc91b3e78a/vllm/compilation/cuda_graph.py#L106). #### Nested Wrapper design The core mechanism of making a full CUDA Graphs and piecewise CUDA Graphs coexist and compatible is the nested CUDA Graphs wrapper design, building on top of piecewise compilation with only a single piecewise FX graph. We wrap a FULL mode wrapper outside the entire model for the full CUDA Graphs functionality; meanwhile, each piecewise backend is wrapped via a `PIECEWISE` mode wrapper inside the compilation. The flow chart below should clearly describe how it works. ![wrapper_flow](../assets/design/cuda_graphs/wrapper_flow.png) Therefore, for a `FULL` runtime mode, it is safe to capture/replay a full CUDA Graph since the piecewise wrapper is not activated. The situation is similar for `PIECEWISE` mode, as there are no conflicts between the `FULL` mode wrapper and `PIECEWISE` mode wrappers. For the `NONE` runtime mode, both `FULL` and `PIECEWISE` wrappers would not be activated, so we simply fall through to eager execution. ### Full CUDA Graph capturing & warm-up The CUDA Graphs capturing happens when the runner first calls the model forward (using `_dummy_run`) with a non-`NONE` runtime mode. For full CUDA Graph capture, we explicitly capture different cases (i.e., prefill/mixed batch or uniform_decode batch) by properly setting attention metadata to make sure the underlying attention backends launch the desired kernel routines. To distinguish prefill/mixed batch or uniform_decode batch, the most important property is the `max_query_len` in attn_metadata (true for most attention backends). We set it to the desired `uniform_query_len` for uniform_decode otherwise we make it just the `num_tokens` for a non-uniform_decode batch. The CUDA Graphs wrapper no longer manages the warm-up logic. The warm-up process is now controlled directly by the GPU model runner, where the `NONE` runtime mode is assigned to play an eager execution for warm-up. When warming up for a full CUDA Graph, it is also important to explicitly run attention during the warmup `dummy_run` call. ## CUDA Graphs Compatibility of Attention Backends To signal the CUDA Graphs compatibility of the attention backends, we introduce a new enum type [AttentionCGSupport][vllm.v1.attention.backends.utils.AttentionCGSupport], which is an enum type that tracks the capability of the attention backend to support CUDA Graphs. The value is sorted in the order of the capability, i.e., `ALWAYS`> `UNIFORM_BATCH`> `UNIFORM_SINGLE_TOKEN_DECODE`> `NEVER`. ```python class AttentionCGSupport(enum.Enum): """ Constants for the CUDA Graphs support of the attention backend Here we do not consider the cascade attention, as currently it is never CUDA Graphs supported.""" ALWAYS = 3 """CUDA Graphs always supported; supports mixed-prefill-decode""" UNIFORM_BATCH = 2 """CUDA Graphs supported for batches the only contain query lengths that are the same, this can be used for spec-decode i.e. "decodes" are 1 + num_speculative_tokens""" UNIFORM_SINGLE_TOKEN_DECODE = 1 """CUDA Graphs supported for batches the only contain query_len==1 decodes""" NEVER = 0 """NO CUDA Graphs support""" ``` Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that case, we seek the minimum capability of all backends to determine the final capability of the model, and we might resolve the incompatible CUDA Graphs mode by downgrading the mode to the best fit one. For example, downgrading `FULL` mode to `FULL_AND_PIECEWISE` mode if the minimum capability is `UNIFORM_BATCH`, or `PIECEWISE` mode if the minimum capability is `NEVER` for -O3 compilation mode. For the complete fallback policy, please see the code for [this][vllm.v1.worker.gpu_model_runner.GPUModelRunner._check_and_update_cudagraph_mode]. The following table lists backends that support full CUDA Graphs at the time of writing. | Attention Backend | cudagraph_support | Comments | |:---|:---|:---| | FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason | | FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good | | Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches | | AITER FlashAttention | `UNIFORM_BATCH`| | | FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell | | FlashMLA | `UNIFORM_BATCH` | | | FlashInferMLA | `UNIFORM_BATCH` | | | AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | Mamba attention| `UNIFORM_SINGLE_TOKEN_DECODE` | | Unlisted backends are all declared as `NEVER`. ## Usage guide Now the CLI is directly using the uppercase string of cudagraph_mode for compilation_config: `--compilation-config '{"cudagraph_mode": "..."}'`, where `...` should be one of `NONE`, `PIECEWISE`, `FULL`, `FULL_DECODE_ONLY`, and `FULL_AND_PIECEWISE`. Note that all `PIECEWISE` related modes require piecewise compilation, and all `FULL` related modes need CUDA Graphs support of attention backends. For example: ```bash vllm serve --model meta-llama/Llama-3.1-8B-Instruct --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' ``` ### Python examples ```python import os os.environ.setdefault("VLLM_LOGGING_LEVEL", "DEBUG") import vllm from vllm.config import CUDAGraphMode compilation_config = {"mode": 3, "cudagraph_mode": "FULL_AND_PIECEWISE"} model = vllm.LLM( model="meta-llama/Llama-3.1-8B-Instruct", dtype="auto", compilation_config=compilation_config, ) sampling_params = vllm.SamplingParams( temperature=0, # greedy decoding max_tokens=1024, ) outputs = model.generate( ["My name is John and"], sampling_params=sampling_params, ) ``` ### Piecewise compilation and full graph custom passes (attention fusion, sequence parallelism) Unfortunately, some custom compile passes have to see the whole graph to be effective and hence aren't compatible with piecewise compilation. This includes `AttnFusionPass` and `SequenceParallelismPass`. As a short-term solution, we automatically disable piecewise compilation (by setting `splitting_ops=[]`) when attention fusion is enabled. We use CUDA Graph modes `FULL` or `FULL_DECODE_ONLY` (depending on backend support). However, this leads to another optimization incompatibility and confusing performance tradeoffs. Long term, we've added the ability to partition the graph in Inductor instead of right after Dynamo. It can be enabled with `CompilationConfig.use_inductor_graph_partition=True` but is currently experimental and only available with `torch>=2.9`. This also increases compilation time as it has to compile the whole graph and cannot reuse piecewise compilation artifacts. Once vLLM supports 2.9, we plan to make this the default approach as it will also speed up piecewise cudagraph capture. ## About the Performance See the following links for examples: * [20059#issuecomment-3160858458](https://github.com/vllm-project/vllm/pull/20059#issuecomment-3160858458) * [20059#issuecomment-3188735226](https://github.com/vllm-project/vllm/pull/20059#issuecomment-3188735226) * [20059#issuecomment-3219888738](https://github.com/vllm-project/vllm/pull/20059#issuecomment-3219888738) --- # Dual Batch Overlap ## Motivation The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments. ## Introduction The Dual Batch Overlap system works by splitting the batch in the model runner, creating two worker threads, and then running the model on each of these worker threads. When DBO is enabled, yield points within the `FusedMoEModularKernel` allow the two CPU worker threads (also called UBatch threads) to ping-pong between each other so that when one is running compute, the other is waiting on communication. Throughout the code, ubatch may be used as a short form of microbatch; this is an ASCII-friendly version of the short form µ-batch. The DBO system includes modifications to `GpuModelRunner` and `ModularKernel`, and defines two utility classes: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` manages thread lifecycle and CUDA graph execution of the model. `UBatchContext` wraps `ForwardContext` to coordinate synchronization between the two UBatch threads. Below is the overlap schedule that is currently implemented in vLLM. ```python # Schedule notation legend: # S = Shared expert # A0 = MLA qkv proj, # A1 = Core attn + out proj + MoE gate # D = Dispatch # C = Combine # Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-| # Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----| # Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, # C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv. # MLP_SHARED_OVERLAP = "mlp_shared_overlap" ``` ## Running with DBO To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. * `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch * `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch Currently, DBO is only supported with DeepEP, so DeepEP must be installed and the `--all2all-backend` argument must be set to `deepep_low_latency` if your workload is primarily decode requests, or `deepep_high_throughput` if your workload is primarily prefill requests. Below is a command that will spin up a two DP rank server with expert parallelism and DBO enabled. EX: `vllm serve deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-dbo --all2all-backend deepep_low_latency` Note that there must be at least two GPUs visible in `CUDA_VISIBLE_DEVICES` ## DBO Components * GPUModelRunner * UBatchWrapper * UBatchContext ### GPU Model Runner The batch is split into microbatches by the `GPUModelRunner` class. This is accomplished in two steps. First, coordination across all DP ranks is performed to determine whether microbatching will be applied. Microbatching must be uniform across all DP ranks. If microbatching is not feasible for any DP rank, it is disabled for all ranks. If all DP ranks are going to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once microbatching has been initiated by all ranks, the second step is performed. The `CommonAttentionMetadata` is sliced in half by the `GPUModelRunner` so that there is one attention metadata per-microbatch. ### UBatchWrapper gpu_ubatch_wrapper The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and CUDA graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. The implementation runs the model twice, once for each microbatch. Each model invocation occurs within a UBatch thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is provided with a sliced version of the attention metadata that is used to run its half of the batch. CUDA graphs for DBO are entirely managed by the `UBatchWrapper`. Because of this, DBO only supports running with Full CUDA graphs. However, once a DBO CUDA graph has been captured, it can be replayed without any multithreading or CPU synchronization. #### Interfaces The `__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. The `forward` method exclusively takes in model arguments. It determines whether or not to run with DBO based on whether a `ubatch_slices` object is present in the `forward_context`. Otherwise, the model is run without DBO. ### UBatchContext ubatch_context The `UBatchContext` class is a `ForwardContext` wrapper class that is used by the `UBatchWrapper` class to synchronize the two UBatch threads. It should only be instantiated by using `make_ubatch_contexts`. When one of the UBatch threads reaches a `dbo_yield` call, it pauses, and starts the other thread which will run until it reaches the same `dbo_yield` call. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` calls in the `FusedMoEModularKernel.forward` method. #### Interfaces The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a CPU thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization. The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to wait on an all-to-all kernel. The `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. The `dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread. --- # How to debug the vLLM-torch.compile integration TL;DR: - use tlparse to acquire torch.compile logs. Include these logs in bug reports and/or support asks. - The vLLM-torch.compile integration is multiple pieces. vLLM exposes flags to turn off each piece: | Online Flag | Offline Flag | Result | |----------|----------|-------------| | --enforce-eager | enforce_eager=True | Turn off torch.compile and CUDAGraphs | | -cc.mode=0 | mode=CompilationMode.NONE | Turn off torch.compile only | | -cc.cudagraph_mode=NONE | compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE) | Turn off CUDAGraphs only | | -cc.backend=eager | compilation_config=CompilationConfig(backend='eager') | Turn off TorchInductor | ## vLLM-torch.compile overview To improve performance, vLLM leverages torch.compile and CUDAGraphs to speed things up. torch.compile generates optimized kernels for PyTorch code while CUDAGraphs eliminates overhead. Most notably, vLLM-compile is NOT torch.compile, it is a custom compiler built using internal PyTorch Compile APIs. ![vLLM-compile diagram](../assets/design/debug_vllm_compile/design_diagram.png) - Given a model, we do a full graph capture via TorchDynamo that is dynamic on the batch size (number of tokens) - vLLM then optionally splits and/or specializes this graph and then uses TorchInductor to compile each graph into a compiled artifact. This step may use vLLM custom Inductor passes to further optimize the graph. - The compiled artifact is saved to vLLM's compile cache so that it can be loaded in the future. - vLLM applies CUDAGraphs to reduce CPU overheads. Things can go wrong in each of the four steps. When something does go wrong, please try to isolate the subsystem that went wrong -- this will allow you to turn off the minimal number of things to keep reliability goals while minimizing impact to performance and also helps us (vLLM) when you open a bug report. For more details on the design, please see the following resources: - [Introduction to vLLM-torch.compile blogpost](https://blog.vllm.ai/2025/08/20/torch-compile.html) - [vLLM-torch.compile integration design](./torch_compile.md) - [vLLM Office Hours #26](https://www.youtube.com/live/xLyxc7hxCJc?si=Xulo9pe53C6ywf0V&t=561) - [Talk at PyTorch Conference 2025](https://youtu.be/1wV1ESbGrVQ?si=s1GqymUfwiwOrDTg&t=725) ## Use tlparse Use [tlparse](https://github.com/meta-pytorch/tlparse) to acquire torch.compile logs. These logs show all stages of the compilation process, including the fused kernels that torch.compile produces. If you can, we recommend sending these or pieces of these along with any bug reports -- they are very helpful. Install tlparse: ```sh pip install tlparse ``` Usage (offline inference) ```sh TORCH_TRACE=~/trace_dir python my_script.py tlparse ~/trace_dir/ ``` Usage (serving) ```sh TORCH_TRACE=~/trace_dir vllm serve # ctrl-c out of the server tlparse ~/trace_dir/ ``` The `tlparse` command outputs some HTML files (perhaps into e.g. `./tl_out/index.html`). Open it to see the logs. It'll look something like the following: ![tlparse example](../assets/design/debug_vllm_compile/tlparse_inductor.png) ## Turn off vLLM-torch.compile integration Pass `--enforce-eager` to turn off the vLLM-torch.compile integration and run entirely in eager mode. This includes turning off CUDAGraphs. ```sh # Online vllm serve --enforce-eager ``` ```py # Offline LLM(model, enforce_eager=True) ``` To turn off just torch.compile, pass `mode = NONE` to the compilation config. (`-cc` is short for `--compilation_config`): ```sh # Online vllm serve -cc.mode=0 ``` ```py # Offline from vllm.config.compilation import CompilationConfig, CompilationMode LLM(model, compilation_config=CompilationConfig(mode=CompilationMode.NONE)) ``` To turn off just CUDAGraphs, pass `cudagraph_mode = NONE`: ```sh # Online vllm serve -cc.cudagraph_mode=NONE ``` ```py # Offline from vllm.config.compilation import CompilationConfig, CUDAGraphMode LLM(model, compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE)) ``` ## Debugging TorchDynamo vLLM requires model code be capturable into a full graph via TorchDynamo (torch.compile's frontend). TorchDynamo does not support all of Python. It will error (in fullgraph mode) if it cannot support a feature (this is sometimes known as a graph break). If you encounter a graph break, please [open an issue to pytorch/pytorch](https://github.com/pytorch/pytorch) so the PyTorch devs can prioritize. Then, try your best to rewrite the code to avoid the graph break. For more information, see this [Dynamo guide](https://docs.pytorch.org/docs/stable/compile/programming_model.dynamo_core_concepts.html). ## Debugging Dynamic Shape full graph capture vLLM requires that the model's forward pass be capturable into a full graph that is dynamic on the batch size (i.e. the number of tokens). It (by default) compiles this one graph into one artifact and uses this artifact for all batch sizes. If your code cannot be captured with Dynamic Shapes, you may see silent incorrectness, loud errors, or CUDA illegal memory accesses. For example, the following is not capturable into a single graph: ```py if data.size[0] % 128 == 0: foo(...) else: bar(...) ``` This problem is easy to diagnose. Use tlparse and click on `compilation_metrics`: it will tell you symbolic constraints on the batch size. If there is any constraint that restricts the batch sizes, then we've got a problem. ![Bad tlparse example](../assets/design/debug_vllm_compile/dynamic_shapes.png) To avoid this, please either: 1. avoid branching on the number of tokens 2. wrap the branching logic into a custom operator. TorchDynamo does not trace into custom operators. ## Debugging constraint violations and dynamic shapes guards issues Dynamic-shape guards are a specific category of Dynamo guards. They are constraints that `torch.compile` attaches to dynamic dimensions (e.g., `seq_len`) to ensure the compiled artifact remains valid. These guards typically appear when framework code, custom passes, or user code branches based on dynamic shape values. **Example:** ```python if x > 10: # path A else: # path B ``` This creates a guard `x > 10` or `x <= 10` depending on which path was traced. **vLLM's Assumption:** vLLM assumes that all guards added by torch.compile are safe to drop and will not constrain the compiled graph to specific input shapes. When this assumption is violated, it can cause issues that users need to debug. Some side effects that indicates this assumption is violated are runtime errors or `ConstraintViolationErrors`. A `ConstraintViolationErrors` will be thrown if a dynamic shape gets constrained to a single value. If you encounter a constraint violation error or suspect that a dynamic shapes guard is being added incorrectly, you can use stricter dynamic shape modes to help debug the issue: ```sh # Online - using unbacked mode vllm serve meta-llama/Llama-3.2-1B -cc.dynamic_shapes_config.type=unbacked # Online - using backed_size_oblivious mode vllm serve meta-llama/Llama-3.2-1B -cc.dynamic_shapes_config.type=backed_size_oblivious ``` ```py # Offline - using unbacked mode from vllm.config.compilation import CompilationConfig, DynamicShapesConfig, DynamicShapesType LLM(model, compilation_config=CompilationConfig( dynamic_shapes_config=DynamicShapesConfig(type=DynamicShapesType.UNBACKED) )) # Offline - using backed_size_oblivious mode from vllm.config.compilation import CompilationConfig, DynamicShapesConfig, DynamicShapesType LLM(model, compilation_config=CompilationConfig( dynamic_shapes_config=DynamicShapesConfig(type=DynamicShapesType.BACKED_SIZE_OBLIVIOUS) )) ``` These modes are stricter and reduce or eliminate the need of dynamic shapes guarding, which can help isolate issues: - `unbacked`: Uses unbacked symints which don't allow guards, making it easier to identify where guards are being incorrectly added - `backed_size_oblivious`: Uses a mode that is more strict about guarding. For more details on dynamic shapes modes, see [Dynamic shapes and vLLM guard dropping](torch_compile.md#dynamic-shapes-and-vllm-guard-dropping). ### Printing guards To see all guards that are being added during compilation, you can use `TORCH_LOGS=+dynamic`: ```sh TORCH_LOGS=+dynamic vllm serve meta-llama/Llama-3.2-1B ``` Look for `[guard added]` in the logs to see where guards are being added. This can help you identify which operations are causing guards to be added incorrectly. ## Debugging TorchInductor TorchInductor takes a captured graph and then compiles it down to some Python code that may call 1+ triton kernels. On rare (but unfortunate) occasions, it may produce an incorrect triton kernel. This may manifest as silent incorrectness, CUDA illegal memory accesses, or loud errors. To debug if TorchInductor is at fault, you can disable it by passing `backend='eager'` to the compilation config: ```sh # online vllm serve -cc.backend=eager ``` ```py # offline LLM(compilation_config=CompilationConfig(backend='eager')) ``` If Inductor is at fault, [file a bug to PyTorch](https://github.com/pytorch/pytorch). If you're feeling adventurous, you can debug the triton kernels in the Inductor output code (that you can locate via using tlparse). ![tlparse example](../assets/design/debug_vllm_compile/tlparse_inductor.png) You can also use `TORCH_LOGS=output_code ` to print the Inductor output code. ### Editable TorchInductor code You can edit the TorchInductor code that gets run by setting `VLLM_COMPILE_CACHE_SAVE_FORMAT=unpacked` or passing `-cc.compile_cache_save_format=unpacked`. The default is `binary`, which means it is not editable. This is a useful technique: you can put breakpoints (e.g. `torch.distributed.breakpoint()`) and print statements in the output code. ## Debugging vLLM-compile cache vLLM built its own cache for torch.compile artifacts. The idea is that the artifacts can be compiled once and then reused after they have been compiled. This is a layer on top of [torch.compile's compiler cache](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). While torch.compile's compiler cache is rock-stable, vLLM's compiler cache is unfortunately not always correct. You can disable it via setting `VLLM_DISABLE_COMPILE_CACHE=1`. You can also manually remove this cache. - Remove vLLM's compile cache with `rm -rf ~/.cache/vllm` (look at logs to see if the location changed) - Remove torch.compile's built-in caches with `rm -rf /tmp/torchinductor_$(whoami)` vLLM's cache is a mapping from cache key to a compiled artifact. vLLM computes the cache key via combining multiple factors (e.g. config flags and model name). If vLLM's compile cache is wrong, this usually means that a factor is missing. Please see [this example](https://github.com/vllm-project/vllm/blob/18b39828d90413d05d770dfd2e2f48304f4ca0eb/vllm/config/model.py#L310) of how vLLM computes part of the cache key. ## Debugging CUDAGraphs CUDAGraphs is a feature that allows one to: - Capture a callable that launches 1+ CUDA kernels into a CUDAGraph - Replay the CUDAGraph The captured CUDAGraph contains all of the memory used during the capture process. The replay of the CUDAGraph reads and writes to exactly the same regions of memory. This leads to some restrictions: 1. In order to use CUDAGraphs on new data, you'll need to copy the data into a buffer that the CUDAGraph is reading from 2. CUDAGraphs only capture CUDA kernels, they don't capture work done on CPU. vLLM uses the raw CUDAGraphs API, which is unsafe when used incorrectly. To turn off just CUDAGraphs, pass `cudagraph_mode = NONE`: ```sh # Online vllm serve -cc.cudagraph_mode=NONE ``` ```py # Offline from vllm.config.compilation import CompilationConfig, CUDAGraphMode LLM(model, compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE)) ``` --- # Fused MoE Modular Kernel ## Introduction FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py) Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. * Contiguous / Standard / Non-Batched, and * Batched !!! note The terms Contiguous, Standard, and Non-Batched are used interchangeably throughout the document. The input activation format completely depends on the All2All Dispatch being used. * In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example. * In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example. The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below ![FusedMoE Non-Batched](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png) ![FusedMoE Batched](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png) !!! note The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain. ## Motivation As can be seen from the diagrams, there are a lot of operations and there can be a variety of implementations for each operation. The set of ways the operations can be put together to make a valid FusedMoE implementation quickly becomes intractable. The Modular Kernel framework addresses this issue, by grouping the operations into logical components. This broad categorization makes the combinations manageable and prevents code-duplication. This also decouples the All2All Dispatch & Combine implementations from the FusedMoE implementations and allows for their independent development and testing. Furthermore, the Modular Kernel framework introduces Abstract classes for the different components thus providing a well-defined skeleton for future implementations. The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward. ## ModularKernel Components FusedMoEModularKernel splits the FusedMoE operation into 3 parts, 1. TopKWeightAndReduce 2. FusedMoEPrepareAndFinalize 3. FusedMoEPermuteExpertsUnpermute ### TopKWeightAndReduce The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. Please find the implementations of TopKWeightAndReduce [here](../../vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). `FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method. The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens. * `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself. * `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction. ### FusedMoEPrepareAndFinalize The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare`, `prepare_no_receive` and `finalize` functions. The `prepare` function is responsible for input activation Quantization and All2All Dispatch. If implemented, The `prepare_no_receive` is like `prepare` except it does not wait to receive results from other workers. Instead it returns a "receiver" callback that must be invoked to wait for the final results of worker. It is not required that this method is supported by all `FusedMoEPrepareAndFinalize` classes, but if it is available, it can be used to interleave work with the initial all to all communication, e.g. interleaving shared experts with fused experts. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section) ![FusedMoEPrepareAndFinalize Blocks](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png) ### FusedMoEPermuteExpertsUnpermute The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions, * apply() * workspace_shapes() * finalize_weight_and_reduce_impl() #### apply() The `apply` method is where the implementations perform * Permute * Matmul with weight W1 * Act + Mul * Quantization * Matmul with weight W2 * Unpermute * Maybe TopK Weight Application + Reduction #### workspace_shapes() The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation. #### finalize_weight_and_reduce_impl() It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section. `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use. ![FusedMoEPermuteExpertsUnpermute Blocks](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png) ### FusedMoEModularKernel `FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects. `FusedMoEModularKernel` pseudocode/sketch, ```py class FusedMoEModularKernel: def __init__(self, prepare_finalize: FusedMoEPrepareAndFinalize, fused_experts: FusedMoEPermuteExpertsUnpermute): self.prepare_finalize = prepare_finalize self.fused_experts = fused_experts def forward(self, DP_A): Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...) workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...) # allocate workspaces workspace_13 = torch.empty(workspace13_shape, ...) workspace_2 = torch.empty(workspace2_shape, ...) # execute fused_experts fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...) # war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations # performs the TopK Weight Application and Reduction. war_impl = self.fused_experts.finalize_weight_and_reduce_impl() output = self.prepare_finalize.finalize(fe_out, war_impl,...) return output ``` ## How-To ### How To Add a FusedMoEPrepareAndFinalize Type Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example, * PplxPrepareAndFinalize type is backed by Pplx All2All kernels, * DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughput All2All kernels, and * DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels. #### Step 1: Add an All2All manager The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](../../vllm/distributed/device_communicators/all2all.py). #### Step 2: Add a FusedMoEPrepareAndFinalize Type This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class. `FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked. `FusedMoEPrepareAndFinalize::has_prepare_no_receive()`: Indicates whether or not this subclass implements `prepare_no_receive`. Defaults to False. `FusedMoEPrepareAndFinalize::prepare_no_receive()`: The prepare_no_receive method implements the Quantization and All2All Dispatch. It does not wait for the result of the dispatch operation but instead returns a thunk that can be invoked to wait for the final results. Typically the Dispatch function from the relevant All2All Manager is invoked. `FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked. `FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise. `FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None. `FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once. `FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank(). We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference. ### How To Add a FusedMoEPermuteExpertsUnpermute Type FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows, `FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format. `FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not. `FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map. `FusedMoEPermuteExpertsUnpermute::workspace_shapes()` / `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` / `FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above. ### FusedMoEModularKernel Initialization `FusedMoEMethodBase` class has 3 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are, * maybe_make_prepare_finalize, * select_gemm_impl, and * init_prepare_finalize #### maybe_make_prepare_finalize The `maybe_make_prepare_finalize` method is responsible for constructing an instance of `FusedMoEPrepareAndFinalize` when appropriate based on the current all2all backend, e.g. when EP + DP is enabled. The base class method currently constructs all the `FusedMoEPrepareAndFinalize` objects for the EP+DP case. Derived classes can override this method to construct prepare/finalize objects for different scenarios, e.g. `ModelOptNvFp4FusedMoE` can construct a `FlashInferCutlassMoEPrepareAndFinalize` for the EP+TP case. Please refer to the implementations in, * `ModelOptNvFp4FusedMoE` #### select_gemm_impl The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object. Please refer to the implementations in, * `UnquantizedFusedMoEMethod` * `CompressedTensorsW8A8Fp8MoEMethod` * `CompressedTensorsW8A8Fp8MoECutlassMethod` * `Fp8MoEMethod` * `ModelOptNvFp4FusedMoE` derived classes. #### init_prepare_finalize Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188). **Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used. ### How To Unit Test We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py). The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are compatible, runs some correctness tests. If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations, 1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](../../tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively. 2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`, `Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`, `Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](../../tests/kernels/moe/modular_kernel_tools/common.py) Doing this will add the new implementation to the test suite. ### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility The unit test file [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` As a side effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked with incompatible types, the script will error. ### How To Profile Please take a look at [profile_modular_kernel.py](../../tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` ## FusedMoEPrepareAndFinalize Implementations See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-modular-all2all-backends) for a list of all the available modular prepare and finalize subclasses. ## FusedMoEPermuteExpertsUnpermute See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-experts-kernels) for a list of all the available modular experts. --- # Integration with Hugging Face This document describes how vLLM integrates with Hugging Face libraries. We will explain step by step what happens under the hood when we run `vllm serve`. Let's say we want to serve the popular Qwen model by running `vllm serve Qwen/Qwen2-7B`. 1. The `model` argument is `Qwen/Qwen2-7B`. vLLM determines whether this model exists by checking for the corresponding config file `config.json`. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L162-L182) for the implementation. Within this process: - If the `model` argument corresponds to an existing local path, vLLM will load the config file directly from this path. - If the `model` argument is a Hugging Face model ID consisting of a username and model name, vLLM will first try to use the config file from the Hugging Face local cache, using the `model` argument as the model name and the `--revision` argument as the revision. See [their website](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information on how the Hugging Face cache works. - If the `model` argument is a Hugging Face model ID but it is not found in the cache, vLLM will download the config file from the Hugging Face model hub. Refer to [this function](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L91) for the implementation. The input arguments include the `model` argument as the model name, the `--revision` argument as the revision, and the environment variable `HF_TOKEN` as the token to access the model hub. In our case, vLLM will download the [config.json](https://huggingface.co/Qwen/Qwen2-7B/blob/main/config.json) file. 2. After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L185-L186) for the implementation. 3. Next, vLLM [inspects](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L189) the `model_type` field in the config dictionary to [generate](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L190-L216) the config object to use. There are some `model_type` values that vLLM directly supports; see [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L48) for the list. If the `model_type` is not in the list, vLLM will use [AutoConfig.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained) to load the config class, with `model`, `--revision`, and `--trust_remote_code` as the arguments. Please note that: - Hugging Face also has its own logic to determine the config class to use. It will again use the `model_type` field to search for the class name in the transformers library; see [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models) for the list of supported models. If the `model_type` is not found, Hugging Face will use the `auto_map` field from the config JSON file to determine the class name. Specifically, it is the `AutoConfig` field under `auto_map`. See [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/main/config.json) for an example. - The `AutoConfig` field under `auto_map` points to a module path in the model's repository. To create the config class, Hugging Face will import the module and use the `from_pretrained` method to load the config class. This can generally cause arbitrary code execution, so it is only executed when `--trust_remote_code` is enabled. 4. Subsequently, vLLM applies some historical patches to the config object. These are mostly related to RoPE configuration; see [here](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/config.py#L244) for the implementation. 5. Finally, vLLM can reach the model class we want to initialize. vLLM uses the `architectures` field in the config object to determine the model class to initialize, as it maintains the mapping from architecture name to model class in [its registry](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/model_executor/models/registry.py#L80). If the architecture name is not found in the registry, it means this model architecture is not supported by vLLM. For `Qwen/Qwen2-7B`, the `architectures` field is `["Qwen2ForCausalLM"]`, which corresponds to the `Qwen2ForCausalLM` class in [vLLM's code](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/model_executor/models/qwen2.py#L364). This class will initialize itself depending on various configs. Beyond that, there are two more things vLLM depends on Hugging Face for. 1. **Tokenizer**: vLLM uses the tokenizer from Hugging Face to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check Hugging Face's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [vllm.tokenizers.hf.get_cached_tokenizer][]. 2. **Model weight**: vLLM downloads the model weight from the Hugging Face model hub using the `model` argument as the model name and the `--revision` argument as the revision. vLLM provides the argument `--load-format` to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass `--load-format dummy` to skip downloading the weights. - It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that: This completes the integration between vLLM and Hugging Face. In summary, vLLM reads the config file `config.json`, tokenizer, and model weight from the Hugging Face model hub or a local directory. It uses the config class from either vLLM, Hugging Face transformers, or loads the config class from the model's repository. --- # Hybrid KV Cache Manager !!! warning This document was written based on commit [458e74](https://github.com/vllm-project/vllm/commit/458e74eb907f96069e6d8a4f3c9f457001fef2ea). This feature is still in its early stage and things may change. ## What is a hybrid model? Many recent "hybrid" LLMs combine multiple attention types within one model. For example: 1. Sliding window attention (sw) + full attention (full): gpt-oss, Gemma 2/3, Ministral, cohere, etc. 2. Mamba + full: Bamba, Jamba, Minimax, etc. 3. Local chunked attention + full: Llama4 To serve these models efficiently, our [KVCacheManager][vllm.v1.core.kv_cache_manager.KVCacheManager] must: 1. Allocate different slots to different layer type, for example: - Full attention layers: reserve slots for **all** tokens. - Sliding window layers: reserve slots only for the most recent **`sliding_window_size`** tokens. 2. Support layer-specific prefix-cache rules, for example: - Full attention: a cache hit prefix requires **all** tokens remain in the KV cache. - Sliding window: a cache hit prefix only requires the last **`sliding_window_size`** tokens remain in the KV cache. ## Definitions 1. **kv hidden size**: The number of bytes to store one token's KV cache for a single layer. 2. **block**: the memory reserved for kv cache are divided into multiple *blocks* with the same *page size* (defined below) 3. **block size**: number of tokens inside a block 4. **page size**: the physical memory size of a block, defined as: $$ \text{num_layers} \times \text{block_size} \times \text{kv_hidden_size} $$ `num_layers` doesn't mean the total number of layers in the model. The exact number depends on the context in this doc. !!! note This is different from `KVCacheSpec.page_size_bytes` in the code, which is defined as: $$ \text{block_size} \times \text{kv_hidden_size} $$ ## Allocation ### High level idea We use a single memory pool for all layer types. The memory pool is split into multiple blocks with the same page size. [KVCacheManager][vllm.v1.core.kv_cache_manager.KVCacheManager] allocates different numbers of blocks to different layers according to its attention type. The core challenge is ensuring every layer type uses the same **page size**. For full-attention-only models, the page size is straightforward, defined as: $$ \text{page_size} = \text{block_size} \times \text{num_hidden_layers} \times \text{kv_hidden_size} $$ However, in hybrid models, `num_hidden_layers` varies by attention type, which would normally produce mismatched page sizes. The cases below show how we unify them. ### Case 1: toy model Let's start with a toy example: a model has 1 full attention layer and 3 sliding window attention layers. All layers have the same `kv_hidden_size`. We let each block to hold `block_size` tokens for one layer, so: $$ \text{page_size} = \text{kv_hidden_size} \times \text{block_size} $$ [KVCacheManager][vllm.v1.core.kv_cache_manager.KVCacheManager] allocates a different number of blocks to each layer. This case is only a toy example. For real models, please refer to the following cases. ### Case 2: same `kv_hidden_size` and a regular pattern When the model has more layers, e.g., 20 sliding window attention layers and 10 full attention layers with the same `kv_hidden_size`. Calling the allocator once per layer (30 calls) is OK but becomes inefficient. As a solution, we group the allocation of layers that need the same number of blocks to reduce the number of calls. The grouping is feasible because there is usually a beautiful ratio between the number of different types of layers. For example: - Gemma-2: 1 sw : 1 full - Llama 4: 3 local : 1 full Our example can be regarded as 2 sw : 1 full. We can allocate blocks as if there are 2 sw and 1 full in the model, and repeat the result by 10 times to generate the `block_ids` for the 30 layers. The page size becomes: $$ 10 \times \text{kv_hidden_size} \times \text{block_size} $$ Assume `block_size` 16, sliding window size 32, request length 112, then for the above example model, we need to allocate 11 blocks (0-6 for full, 7-8 for sw group 1, 9-10 for sw group 2). ![Allocation Result](../assets/design/hybrid_kv_cache_manager/basic_grouping_example.png) Here, "/" denotes no block needed (sliding‑window layers don't need slots for early tokens). See the formal definition below. The layers are divided into multiple *KV Cache Groups* so that there is: 1. **Identical attention type inside each group**: Each group only contains layers with the same attention type and thus need the same number of blocks for a given request. This enables layers in the same group share the same block ids without memory waste. 2. **Identical page size across groups**: Because our memory pool only have one page size. Our example model is divided into 3 KV cache groups: - Group 0: 10 full attention layers (full.0 - full.9) - Group 1: 10 sliding window attention layers (sw.0 - sw.9) - Group 2: 10 sliding window attention layers (sw.10 - sw.19) Obviously, it satisfies rule 1. For rule 2, all 3 groups have $$ 10 \times \text{kv_hidden_size} \times \text{block_size} $$ as their page size. ### Case 3: same `kv_hidden_size` and no regular pattern Unfortunately, not all models have such a beautiful ratio, and approach in Case 2 will produce too many small groups. For example, Gemma-3-27b has 52 sliding window attention layers and 10 full attention layers. With the constraints in case 2, it would be 26 sliding window groups and 5 full attention groups, each contains 2 layers. The allocation is still inefficient. To reduce the number of kv cache groups, we group layers using the smallest layer count among all attention types. For example, min(52, 10)=10 layers per group in Gemma-3-27b. Then the grouping result is: - Group 0: 10 full attention layers (full.0 - full.9) - Group 1: 10 sliding window attention layers (sw.0 - sw.9) - Group 2: 10 sliding window attention layers (sw.10 - sw.19) - ... - Group 6: 10 sliding window attention layers (sw.40 - sw.49) - Group 7: 2 sliding window attention layers (sw.50 - sw.51) and 8 padding layers We will update this algorithm if this heuristic leads to a bad result when a new model comes out (e.g., 20 full + 30 sw, the group size should be 10 instead of 20). This case happens in Gemma-3 series models, and models in case 2 but with eagle speculative decoding which introduce one full attention layer. The solution has some memory waste and is not perfect. Please report any cases where padding overhead becomes unacceptable so we can refine the algorithm. ### Case 4: different `kv_hidden_size` (mainly hybrid mamba models) Some architectures (e.g., Bamba, Jamba, Minimax) interleave standard attention layers with Mamba layers, where each Mamba layer's state size per token can be much larger than the attention layers' `kv_hidden_size`. Because we only support a single page size across all groups, we must reconcile these differing hidden sizes. The current algorithm is: 1. Increase the `block_size` of attention layers until $$ \text{block_size} \times \text{kv_hidden_size}_{\text{att}} \ge \text{state_size}_{\text{mamba}} $$ 2. Pad the mamba state per layer to $$ \text{block_size} \times \text{kv_hidden_size}_{\text{att}} $$ 3. Apply the grouping strategy in case 3. !!! note This can lead to more than 400 `block_size` for attention layers, which is too large. Another padding strategy is to increase `block_size` until $$ \text{block_size} \times \text{kv_hidden_size}_{\text{att}} \times \text{num_attn_layers} \ge \text{state_size}_{\text{mamba}} $$ This padding strategy is still a work in progress. ### Case 5: KV sharing KV sharing refers to a layer using the KV cache of another layer, e.g., gemma-3n. In these models, [KVCacheManager][vllm.v1.core.kv_cache_manager.KVCacheManager] ignores all layers with kv sharing and only allocates KV cache for layers that need kv cache, and some patches are made in model runner to apply the allocation result to kv sharing layers. ## Prefix caching For simplicity, we assume `block_size=1` in this section. ### High level idea The block pool uses a dict similar to `tuple(block_hash, group_id) -> block` to catch the full blocks. That means the same tokens of different groups are cached and evicted independently. When a new request comes in, we check the cache hit prefix of each group, and return the intersection of these groups as the cached prefix of the request. See below for the detailed algorithm for checking the cache hit of one group & performing the intersection. ### Case 0: full attention only models For full attention layers, blocks are allocated for all tokens in the request. For details on the underlying design, see [Prefix Caching](prefix_caching.md) To find the longest cache hit prefix of a request, we enumerate from left (the first block) to right (the last block), checking whether the block is cached, and exit when cache misses. For example, we will return the first 7 tokens (0-6) as the cache hit prefix in the below example (blue blocks are cached): ![Prefix Caching of Full Attention](../assets/design/hybrid_kv_cache_manager/full_attn.png) ### Case 1: sliding window attention only models For sliding window attention layers, a naive implementation for memory allocation is to allocate `sliding_window_size` blocks and fill in the blocks in a round-robin way. But this naive implementation is not compatible with prefix caching so we didn't pick this design. In vLLM, we allocate different blocks for different tokens and free blocks that are outside the sliding window. For a new request, the cache hit prefix only requires the last `sliding_window_size - 1` tokens being cached. Let's say `sliding_window_size = 4` and `block_size = 1`, and the request is a 15-token prompt (blue blocks are cached): ![Prefix Caching of Sliding Window Attention](../assets/design/hybrid_kv_cache_manager/sw_attn.png) There are 3 possible cache hit prefixes: - cache hit length 5, compute prefill with [2, 3, 4] → [5, 6, …, 14] - cache hit length 6, compute prefill with [3, 4, 5] → [6, 7, …, 14] - cache hit length 14, compute prefill with [11, 12, 13] → [14] (most efficient) We can check the cache hit from right to left, and early exit when we find a match.This is opposite from full attention, where we check from left to right and early exit when the match fails. One potential cons (compared to full attention) is that we end up iterating over the entire list of tokens when there's no match, which is often a common case. This could potentially cause non-negligible overheads, but fine with full + swa, as discussed below. ### Case 2: sliding window attention + full attention models The first problem is how to find the cache hit prefix. We need to "intersect" the cache hits of global and sliding window attention layers by: 1. Get the longest cache hit for full attention (scanning from left to right) 2. Get the longest cache hit for sliding window attention that is within that length. Implemented by checking cache hits from right to left starting from the cache hit length of full attention. It can be ensured that the resulting cache hit of sliding window attention layers is also a cache hit of full attention layers. This is more efficient than finding all possible prefixes of each group and doing the intersection, because our approach can exit early if there is no cache hit. The algorithm applies to models with exactly two attention types full attention + X, where X can be an arbitrary efficient attention algorithm like sliding window, llama 4 local attention, and mamba. It doesn't support models without full attention layers, and models with more than 2 types of attention. This is enough for most hybrid models at the moment of writing this doc. The second question is the cache eviction policy. For now, we use one LRU queue for all kv cache groups. The blocks are added to the LRU queue when freed, either because the request is finished or the block is out of the sliding window. ### Case 3: mamba models The prefix caching support of the mamba model is work in progress. Once implemented, models with mamba layer + full attention layer can be supported via the full attention + X algorithm in case 2. ## Implementation ### Overview ![Overview of Hybrid KV Cache Manager](../assets/design/hybrid_kv_cache_manager/overview.png) The `KVCacheManager` is organized into 3 layers: - **[KVCacheManager][vllm.v1.core.kv_cache_manager.KVCacheManager]**: The interface between the scheduler and kv cache management system. - **[KVCacheCoordinator][vllm.v1.core.kv_cache_coordinator.KVCacheCoordinator]**: coordinate per-group SingleTypeKVCacheManagers to generate the allocation result of a request. Depending on the model's configuration, one of these coordinators is chosen: - **[KVCacheCoordinatorNoPrefixCache][vllm.v1.core.kv_cache_coordinator.KVCacheCoordinatorNoPrefixCache]**: Used when prefix caching is disabled. - **[UnitaryKVCacheCoordinator][vllm.v1.core.kv_cache_coordinator.UnitaryKVCacheCoordinator]**: If only one KV cache group. The prefix caching logic is simplified as no intersection is needed. - **[HybridKVCacheCoordinator][vllm.v1.core.kv_cache_coordinator.HybridKVCacheCoordinator]**: Handles exactly two KV cache groups (must include one full‑attention group plus one other efficient‑attention group). Other cases are not implemented. You can disable prefix caching to use the KVCacheCoordinatorNoPrefixCache. - **[SingleTypeKVCacheManager][vllm.v1.core.single_type_kv_cache_manager.SingleTypeKVCacheManager]**: Each instance manages allocation and prefix caching for one KV cache group, implementing the attention‑type–specific logic (e.g., full attention, sliding window, Mamba). The blue box in the above figure shows the case with 10 full attention layers and 20 sliding window attention layers, thus: - use `HybridKVCacheCoordinator` - use 1 `FullAttentionManager` and 2 `SlidingWindowManager` for the 3 `KVCacheGroup`s. ### Memory Layout For a model with n `KVCacheGroup`s, each with m layers, we allocate m buffers. Each buffer is shared by n layers, one from each group. The following figure is for a model with 10 full attention layers (full.0 - full.9) and 20 sliding window attention layers (sw.0-sw.19). It follows "case 2" in "Allocation" section and is divided into 3 groups: - Group 0: 10 full attention layers (full.0 - full.9) - Group 1: 10 sliding window attention layers (sw.0 - sw.9) - Group 2: 10 sliding window attention layers (sw.10 - sw.19) And for a request, we allocate 11 blocks with `block_id` 0-6 to group 0, 7-8 to group 1, and 9-10 to group 2. With such an example, the physical memory is divided into 10 buffers (`KVCacheTensor` 0 - `KVCacheTensor` 9). Each buffer is shared by 3 layers (e.g., `KVCacheTensor` 0 is shared by full.0 from group 0, sw.0 from group 1, and sw.10 from group 2) and is divided into pieces with size `block_size * kv_hidden_size`. The KV cache of these 3 attention layers are saved to different pieces of the buffer based on the allocated `block_ids`: ![Example Memory Layout](../assets/design/hybrid_kv_cache_manager/memory_layout.png) !!! note One logic "block" is mapped to 10 pieces in the 10 buffers of the physical memory. --- # IO Processor Plugins IO Processor plugins are a feature that allows pre- and post-processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output. When performing an inference with IO Processor plugins, the prompt type is defined by the plugin and the same is valid for the final request output. vLLM does not perform any validation of input/output data, and it is up to the plugin to ensure the correct data is being fed to the model and returned to the user. As of now these plugins support only pooling models and can be triggered via the `encode` method in `LLM` and `AsyncLLM`, or in online serving mode via the `/pooling` endpoint. ## Writing an IO Processor Plugin IO Processor plugins implement the [`IOProcessor`][vllm.plugins.io_processors.interface.IOProcessor] interface: ```python IOProcessorInput = TypeVar("IOProcessorInput") IOProcessorOutput = TypeVar("IOProcessorOutput") class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]): def __init__(self, vllm_config: VllmConfig): self.vllm_config = vllm_config @abstractmethod def pre_process( self, prompt: IOProcessorInput, request_id: str | None = None, **kwargs, ) -> PromptType | Sequence[PromptType]: raise NotImplementedError async def pre_process_async( self, prompt: IOProcessorInput, request_id: str | None = None, **kwargs, ) -> PromptType | Sequence[PromptType]: return self.pre_process(prompt, request_id, **kwargs) @abstractmethod def post_process( self, model_output: Sequence[PoolingRequestOutput], request_id: str | None = None, **kwargs, ) -> IOProcessorOutput: raise NotImplementedError async def post_process_async( self, model_output: AsyncGenerator[tuple[int, PoolingRequestOutput]], request_id: str | None = None, **kwargs, ) -> IOProcessorOutput: # We cannot guarantee outputs are returned in the same order they were # fed to vLLM. # Let's sort them by id before post_processing sorted_output = sorted( [(i, item) async for i, item in model_output], key=lambda output: output[0] ) collected_output = [output[1] for output in sorted_output] return self.post_process(collected_output, request_id, **kwargs) @abstractmethod def parse_request(self, request: Any) -> IOProcessorInput: raise NotImplementedError def validate_or_generate_params( self, params: SamplingParams | PoolingParams | None = None ) -> SamplingParams | PoolingParams: return params or PoolingParams() @abstractmethod def output_to_response( self, plugin_output: IOProcessorOutput ) -> IOProcessorResponse: raise NotImplementedError ``` The `parse_request` method is used for validating the user prompt and converting it into the input expected by the `pre_process`/`pre_process_async` methods. The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference. The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output. The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters. The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/pooling/pooling/serving.py). An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/pooling/plugin/prithvi_geospatial_mae_client.py](../../examples/pooling/plugin/prithvi_geospatial_mae_client.py)) and offline ([examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py](../../examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py)) inference examples. ## Using an IO Processor plugin IO Processor plugins are loaded at engine startup and there are two methods for specifying the name of the plugin to be loaded: 1. Via vLLM's `EngineArgs`: setting the `io_processor_plugin` argument in the `EngineArgs` used to initialize the `AsyncLLM`. The same can be achieved by passing the `io_processor_plugin` argument to `LLM` in offline mode, or by passing the `--io-processor-plugin` argument in serving mode. 2. Via the model HF configuration: adding an `io_processor_plugin` field to the model config (config.json). The order also determines method priority. i.e., setting the plugin name via `EngineArgs` will override any plugin name specified in the model HF config (config.json). --- # Logits Processors !!! important Some logits processors design changes are still in progress and the API may change in the near future. We hope to stabilize this part of the API soon This document describes how the vLLM engine interacts with logits processors, and the programming model which vLLM supports for implementing logits processors. ## Logits Processors Background A logits processor adjusts the next-token probability distribution, usually with the intention of steering the model towards a desired type of behavior. In vLLM, logits processors operate at batch granularity. During a given engine step, the logits processor consumes a `(num_requests) x (vocab_size)` tensor of raw logits output by the model. For all requests which enable the logits processor, the logits processor applies a transformation to the corresponding row of the logits tensor, while leaving other rows unmodified. The transformed logits tensor is then passed to softmax. ## Logits Processors in the vLLM engine The vLLM engine's persistent batch data structure maintains a list of loaded logits processors. In order to operate on the entire batch at once, each logits processor may maintain metadata about the requests in the batch (i.e. each request's logits-processor-specific configuration settings). Therefore, logits processors are stateful. In each engine step, the vLLM engine will (1) update each logits processor's internal state and (2) apply logits processors to the model output logits. ### Updating Logits Processor Internal State At the beginning of each engine step, the persistent batch may add, discard and/or reorder requests in response to the scheduler output. After the persistent batch has reorganized, the vLLM engine invokes each logits processor's `update_state()` method. This is necessary to ensure that logits processors' internal states are reorganized to match the new persistent batch state at the beginning of the engine step. The pseudocode below shows the process by which the vLLM persistent batch notifies each logits processor of changes in batch state: ??? code "Model Runner Updates Logits Processor States" ``` python # gpu_model_runner.py class GPUModelRunner(...): ... def execute_model(self, scheduler_output, ...): self._update_states(scheduler_output) ... def _update_states(...): ... # ...update persistent batch to reflect new/finished requests & reordering # of requests within batch... ... self.input_batch.refresh_metadata() # gpu_input_batch.py class InputBatch: ... def refresh_metadata(self): ... # Update each logits processor's state to reflect persistent batch state batch_update = self.batch_update_builder.get_and_reset(self.num_reqs) for logit_proc in self.logitsprocs.all: logit_proc.update_state(batch_update) ... # vllm/v1/sample/logits_processor/interface.py @dataclass(frozen=True) class BatchUpdate: # Batch state-change data structure which is passed to logits processors' # update_state() methods batch_size: int removed: Sequence[RemovedRequest] added: Sequence[AddedRequest] moved: Sequence[MovedRequest] ``` ### Applying Logits Processors to the Model Output Logits After updating persistent batch state, the vLLM model runner performs model inference to obtain logits. Then, the model runner invokes the sampler against the logits. In turn, part of the sampler's operation is to invoke the logits processors' `apply()` methods against the model output logit processors, yielding transformed logits (the `apply()` methods may modify the logits in-place or out-of-place, although in-place is more memory-efficient). This process is shown in the pseudocode below. Note that the sampler will access the logits processors via `SamplingMetadata.logitsprocs`. When the vLLM engine constructs `SamplingMetadata` (not shown in the code below), the reference to the list of logits processors is passed from the persistent batch data structure to `SamplingMetadata`. ??? code "Apply logits processors to model output logits" ``` python # gpu_model_runner.py class GPUModelRunner(...): ... def execute_model(self, scheduler_output, ...): # (discussed in previous section) self._update_states(scheduler_output) ... # ...run model inference to obtain logits... ... # Invoke sampler, which applies logits processors sampler_output = self.sampler(logits=logits, sampling_metadata=sampling_metadata) ... # sampler.py class Sampler(nn.Module): ... def forward(self, logits, sampling_metadata): ... # Apply non-argmax-invariant logits processors to model output logits for processor in (sampling_metadata.logitsprocs.non_argmax_invariant): logits = processor.apply(logits) sampled = self.sample(logits, sampling_metadata) ... # ...return sampler output data structure... def sample(self, logits, sampling_metadta) ... # ...exit early if all requests are greedy-sampling... ... # Apply argmax-invariant logits processors for processor in sampling_metadata.logitsprocs.argmax_invariant: logits = processor.apply(logits) ... # ...perform sampling and return sampling result... ``` At sampling time, the sampler checks whether all requests in the persistent batch employ greedy sampling. If that is the case, the sampler saves compute by skipping "argmax-invariant" logits processors. Here, "argmax" is shorthand for the token ID with the highest logit value in a given row of the logits tensor (i.e. the token which the model weighted the highest for a given request). * An **argmax-invariant logits processor** is a logits processor (such as Min-P) which does not modify the argmax. For example, a logits processor which masks out the lowest-probability tokens will not change which token ID has the max logit. Greedy sampling always picks the highest-logit-value token ID, and so conceptually an argmax-invariant logits processor can be skipped for greedy sampling requests. * A **non-argmax-invariant logits processor** is a logits processor which may modify the argmax. For example, a logits processor which masks all tokens except for EOS after a certain number of steps in order to force decoding to terminate might end up masking the max-logit-value token and therefore change the argmax. Conceptually, these logits processors cannot be skipped for greedy sampling requests. The vLLM logits processor abstraction requires the engine to apply logits processors at batch granularity; therefore in practice the argmax-invariant logits processors can only be skipped when the entire batch uses greedy sampling. ## Logits Processor Programming Model The previous sections alluded to the interfaces which vLLM logits processors must support. This section introduces in full the programming model for implementing logits processors that are compatible with the vLLM engine, including the `LogitsProcessor` base class and its interface methods as well as the `BatchUpdate` data structure for representing persistent batch state changes, both of which are shown in the code below: ??? code "`LogitsProcessor` base class and `BatchUpdate` data structure" ``` python from abc import ABC, abstractmethod from collections.abc import Sequence from dataclasses import dataclass from enum import Enum, auto from typing import TYPE_CHECKING import torch from vllm import SamplingParams if TYPE_CHECKING: from vllm.config import VllmConfig class MoveDirectionality(Enum): # One-way i1->i2 req move within batch UNIDIRECTIONAL = auto() # Two-way i1<->i2 req swap within batch SWAP = auto() # (index, params, prompt_tok_ids, output_tok_ids) tuples for new # requests added to the batch. AddedRequest = tuple[int, SamplingParams, list[int], list[int]] # (index 1, index 2, directionality) tuples representing # one-way moves or two-way swaps of requests in batch MovedRequest = tuple[int, int, MoveDirectionality] # Batch indices of any removed requests. RemovedRequest = int @dataclass(frozen=True) class BatchUpdate: """Persistent batch state change info for logitsprocs""" batch_size: int # Current num reqs in batch # Metadata for requests added to, removed from, and moved # within the persistent batch. # # Key assumption: the `output_tok_ids` list (which is an element of each # tuple in `added`) is a reference to the request's running output tokens # list; via this reference, the logits processors always see the latest # list of generated output tokens removed: Sequence[RemovedRequest] moved: Sequence[MovedRequest] added: Sequence[AddedRequest] class LogitsProcessor(ABC): @abstractmethod def __init__(self, vllm_config: "VllmConfig", device: torch.device, is_pin_memory: bool) -> None: raise NotImplementedError @abstractmethod def apply(self, logits: torch.Tensor) -> torch.Tensor: raise NotImplementedError @abstractmethod def is_argmax_invariant(self) -> bool: """True if logits processor has no impact on the argmax computation in greedy sampling. NOTE: may or may not have the same value for all instances of a given LogitsProcessor subclass, depending on subclass implementation. """ raise NotImplementedError @abstractmethod def update_state( self, batch_update: "BatchUpdate" | None, ) -> None: """Called when there are new output tokens, prior to each forward pass. Args: batch_update is non-None iff there have been changes to the batch makeup. """ raise NotImplementedError @classmethod def validate_params(cls, sampling_params: SamplingParams): """Validate sampling params for this logits processor. Raise ValueError for invalid ones. """ return None ``` A vLLM logits processor must subclass `LogitsProcessor` and define (at minimum) the following methods: * `__init__(self, vllm_config: VllmConfig, device: torch.device, is_pin_memory: bool)` * `vllm_config`: engine configuration data structure * `device`: hardware accelerator device info * `is_pin_memory`: flag indicating whether pin memory is available to support logits processor implementation * `apply(self, logits: torch.Tensor) -> torch.Tensor`: * Consume a `(num_requests) x (vocab_size)` logits tensor (`logits`) * Apply logits processor transformation at batch granularity * Return a transformed `(num_requests) x (vocab_size)` logits tensor * You can modify the input logits processors in-place or out-of-place; in-place is more memory-efficient * `is_argmax_invariant(self) -> bool`: * Return `True` if the logits processor is argmax invariant (never changes what is the highest-logit-value token ID for a given request), `False` if the logits processor may modify argmax * `is_argmax_invariant()` is evaluated once at startup; if `True`, vLLM will skip applying this logits processor in a given step when all requests use greedy sampling * `update_state(self, batch_update: "BatchUpdate" | None) -> None`: * Consume a `BatchUpdate` data structure representing persistent batch state changes at the beginning of the current engine step * Use the `BatchUpdate` members to update logits processor internal state * **Note:** batch update data structure may be `None`, signaling no change to the batch constituents. In this case, the LogitsProcessor might still want to update its state based on the updated `output_token_ids` lists that it could have retained when they were added. * `validate_params(cls, sampling_params: SamplingParams)`: * Raise `ValueError` if `SamplingParams` has invalid arguments (especially custom arguments) used by logits processor. * When request is sent to entrypoint, `validate_params()` will validate `SamplingParams` and refuse request with invalid arguments. ### `BatchUpdate` data structure The `BatchUpdate` abstraction models the persistent batch as a list of requests, supporting the following operations to change batch state (note that the order in which the operations are mentioned below reflects the order in which they should be processed in `update_state()`): * **Remove:** remove (without replacement) request at index `i` * A Remove is represented in `Batchupdate.removed` by an `int` (representing `i`) * Effect of remove-at-index on batch: ``` text Batch: [A,B,C] Remove @ i: 1 => New Batch: [A,x,C] # Discard B and leave an empty slot ``` * **Add:** add (or replace existing request with) a new request at index `i`. If a request is replaced, its associated state should be discarded. * An Add is represented in `Batchupdate.added` as a tuple of ``` text (index, new request SamplingParams, prompt token ids, output token ids) ``` * `prompt token ids` and `output token ids` are references to the request's prompt token ids and output token ids lists, respectively. Note that the output token ids list grows with each engine step, and this growth is visible to the logits processor because output token ids are passed by reference. **This is important for LogitsProcessors that take into account the tokens generated so far**. * The implementation of the particular logits processor subclass determines whether or how the fields in the added request tuple are digested into an internal representation. For example, a logits processor that does not utilize prompt or output token ids may only need to utilize `index` and `SamplingParams` and discard the other tuple fields * If index `i` currently holds a request, a replacement occurs: ``` text Batch: [A,B,C] New request to be added @ i: D @ 1 => New Batch: [A,D,C] # Add D, discard B ``` * If index `i` does not currently hold a request (because `i` is out of bounds of the current batch size): ``` text Batch: [A,B,C] New request to be added @ i: D @ 3 => New Batch: [A,B,C,D] # Add D, extending batch ``` * **Move:** move request at index `s` to index `d` OR swap requests at indices `s` and `d` * A Move is represented in `Batchupdate.moved` as a tuple of ``` text (s, d, UNIDIRECTIONAL or SWAP) ``` * If the Move specifies `UNIDRECTIONAL`: * The request at index `s` is moved to index `d`; index `s` becomes an empty slot ``` text Batch: [A,x,C,D] Unidirectionally Move s -> d: 3 -> 1 => New Batch: [A,D,C,x] # Move D to 1, leaving empty slot at 3 ``` * If another request already resided at index `d`, it is replaced and discarded ``` text Batch: [A,B,C,D] Unidirectionally Move s -> d: 3 -> 1 => New Batch: [A,D,C,x] # Move D to 1, discarding B and leaving empty slot at 3 ``` * If the Move specifies `SWAP`, the requests at `s` and `d` exchange indices ``` text Batch: [A,B,C,D] Swap Move s <-> d: 3 <-> 1 => New Batch: [A,D,C,B] # Swap B and D ``` Additionally, the `BatchUpdate` data structure includes a representation (`batch_size`) of the size of the persistent batch at the beginning of the engine step. ### How the vLLM engine builds the `BatchUpdate` data structure Logits processor `update_state()` implementations should assume the following model for how the model runner updates persistent batch state (expressed here in terms of the `BatchUpdate` abstraction): 1. Identify indices of requests which finished in the current engine step 2. Identify new requests introduced in the current step 3. Use Add operations to replace as many finished requests with new requests, in order of increasing index of the replaced request starting with the lowest index 4. Based on the relative number of new and finished requests: 1. If the numbers of new and finished requests are the same, proceed to next step 2. *If there are more new requests than finished requests:* apply Add operations to extend the batch with the remaining new requests which did not replace finished requests. Assign consecutive indices to these new requests, starting with `current_max_batch_index + 1` 3. *If there are fewer new requests than finished requests:* * Apply Remove operations to finished requests which were not replaced with new requests. These removed request indices will necessarily be greater than the greatest index of the finished requests which were replaced in the previous step. The Removes may leave the batch in a non-contiguous state * **"Condense" the batch to be contiguous:** starting with the lowest-index empty slot (which was caused by a Remove), apply a Unidirectional Move from the current highest non-empty slot in the batch to fill the empty slot. Proceed with additional Unidirectional Move operations in order of increasing empty slot destination index and decreasing non-empty slot source index until the batch is contiguous * **Shrink the batch:** a side effect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots 5. Reorder the batch for improved efficiency. Depending on the attention backend implementation and the current characteristics of the batch, zero or more Swap Move operations may be applied to reorder the batch Notes: * A logits processor `update_state()` method must process batch update operations in the following order: removes, adds, moves * The index argument for Add operations refers to the index *at the time the Add occurred*, i.e. before any Move operations * Example: if a request is Added at index 5 and then swapped with index 3, the Add operation in `BatchUpdate.added` will be associated with index 5 not 3 * In other words Move operations can be assumed to be applied after Adds and Removes * Move operations can be assumed to be applied in the order in which they appear in `BatchUpdate.moved` * If there are no new/finished requests and there is no batch reordering, then the batch update for the logits processors will be `None` #### Example: Batch Update with Fewer New Requests Than Finished Requests The following example models an engine step where 1 new request is introduced and 2 finished requests are eliminated, additionally the attention backend performs a swap to optimize the batch ordering. ``` text Batch state (beginning of engine step): [A,B,C,D] Batch size: 4 New requests: E Finished requests: A, C Processing steps (using BatchUpdate abstraction): 1. Add E at index 0 [E,B,C,D] # Discard A Batch size: 4 2. Remove at index 2 [E,B,x,D] # Discard C, empty slot at index 2 Batch size: 4 3. Condense batch with a Unidirectional Move 3 -> 2 operation and shrink batch [E,B,D] x # Empty slot is now outside batch Batch size: 3 4. Attention backend optimization: reorder batch with Swap 0 <-> 1 [B,E,D] Batch size: 3 ``` The resulting `BatchUpdate` data structure will look like ``` text BatchUpdate instance * added: [(0,E's SamplingParams,E's prompt tokens ref,E's output tokens ref)] * removed: [2] # request C was removed without replacement * moved: [(3,2,UNIDIRECTIONAL),(0,1,SWAP)] ``` #### Example: Batch Update with More New Requests Than Finished Requests The following example models an engine step where 2 new requests are introduced and 1 finished request is eliminated, additionally the attention backend performs a swap to optimize the batch ordering. ``` text Batch state (beginning of engine step): [A,B,C,D] Batch size: 4 New requests: E,F Finished requests: C Processing steps (using BatchUpdate abstraction): 1. Add E at index 2 [A,B,E,D] # Discard C Batch size: 4 2. Add F at index 4 (current max batch index + 1) [A,B,E,D,F] # Extend batch by 1 Batch size: 5 4. Attention backend optimization: reorder batch with Swap 0 <-> 1 [B,A,E,D,F] Batch size: 5 ``` Note that batch condensation is skipped because there are no empty slots left behind by Remove operations. The resulting `BatchUpdate` data structure will look like ``` text BatchUpdate instance * added: [(2,E's SamplingParams,E's prompt tokens ref,E's output tokens ref),(4,F's SamplingParams,F's prompt tokens ref,F's output tokens ref)] * removed: [] # no requests were removed without replacement * moved: [(0,1,SWAP)] ``` ## How to Introduce a New Logits Processor to vLLM ### Best Practices for Writing Built-In Logits Processors * Write efficient `apply()` and `update_state()` implementations in light of the fact that logits processors operate at batch granularity * For example, you may be able to use efficient vectorized operations to implement `apply()` or update internal state vectors in `update_state()` * However, if you think that a logits processor may be used infrequently, it may be appropriate to use a "sparse" representation of request state i.e. the class can represent request configuration using a dictionary which only stores metadata about requests that enable the logits processor * It is up to the logits processor author to determine: 1. **The per-request attributes which configure the logits processor's behavior against that request.** For example, if you are writing a new built-in logits processor for vLLM, you may or may not need to add additional fields to `SamplingParams` and the vLLM REST API 2. **The conditions under which the logits processor is or is not enabled on a per-request basis.** Unless your intention is for the built-in logits processor to act on all requests all the time, you should write your logits processor in such a way that it is possible to disable the logits processor for a given request, i.e. by defaulting an argument to `None` or by passing in a specific do-nothing argument value i.e. `0.0`. Try to save compute and memory for requests which disable the logits processor 3. **The conditions under which the logits processor is short-circuited at the batch level.** Even if you have defined a way to disable the built-in logits processor at the request level, it may be difficult to translate this into compute savings i.e. if your `update_state()` and `apply()` implementations use efficient vectorized implementations that operate on the whole persistent batch in a single command. For example, you cannot skip an entire vectorized operation in `apply()` just because one request disabled the logits processor. To save compute in the edge-case where no running requests utilize the built-in logits processor, we recommend designing `apply()` to return the unmodified input tensor if all requests have the logits processor disabled. Similarly, consider whether steps can be skipped in `update_state()` if no requests enable the logits processor * Additionally, an easy way to save compute in `update_state()` is to exit early when the batch_update is `None` * Ensure that the logits processor `update_state` method discards information about finished requests (i.e. requests which are replaced by an Add or which are subject to a Remove) * `is_argmax_invariant()` can be hard-coded to `True` or `False` if the logits processor has consistent behavior. However the argmax invariance may also be determined programmatically (i.e. if your logits processor is user-customizable in some way that impacts whether the logits processor is argmax invariant). For this reason, `is_argmax_invariant()` is not a class method ### Built-In Logits Processors Built-in logits processors are always loaded when the vLLM engine starts. See the existing vLLM built-in logits processors in `vllm/v1/sample/logits_processor/builtin.py` for examples of how to write a new built-in vLLM logits processor. It makes sense to write a PR to introduce a new logits processor as a built-in if it is likely to be useful to a wide audience. vLLM currently employs the following built-in logits processors based on the programming model described above: * Min-P * Logit bias * Min-tokens Review these logits processor implementations for guidance on writing built-in logits processors. Additionally, the following logits-processor-like functionalities are hard-coded into the sampler and do not yet utilize the programming model described above. Most of them will be refactored to use the aforementioned logits processor programming model. * Allowed token IDs * Bad words * Repetition penalty * Frequency penalty * Presence penalty * Temperature * Top-K * Top-P ### Custom Logits Processors vLLM can be augmented with [user-provided custom logits processors](../features/custom_logitsprocs.md). --- # LoRA Resolver Plugins This directory contains vLLM's LoRA resolver plugins built on the `LoRAResolver` framework. They automatically discover and load LoRA adapters from a specified local storage path, eliminating the need for manual configuration or server restarts. ## Overview LoRA Resolver Plugins provide a flexible way to dynamically load LoRA adapters at runtime. When vLLM receives a request for a LoRA adapter that hasn't been loaded yet, the resolver plugins will attempt to locate and load the adapter from their configured storage locations. This enables: - **Dynamic LoRA Loading**: Load adapters on-demand without server restarts - **Multiple Storage Backends**: Support for filesystem, S3, and custom backends. The built-in `lora_filesystem_resolver` requires a local storage path, but custom resolvers can be implemented to fetch from any source. - **Automatic Discovery**: Seamless integration with existing LoRA workflows - **Scalable Deployment**: Centralized adapter management across multiple vLLM instances ## Prerequisites Before using LoRA Resolver Plugins, ensure the following environment variables are configured: ### Required Environment Variables 1. **`VLLM_ALLOW_RUNTIME_LORA_UPDATING`**: Must be set to `true` or `1` to enable dynamic LoRA loading ```bash export VLLM_ALLOW_RUNTIME_LORA_UPDATING=true ``` 2. **`VLLM_PLUGINS`**: Must include the desired resolver plugins (comma-separated list) ```bash export VLLM_PLUGINS=lora_filesystem_resolver ``` 3. **`VLLM_LORA_RESOLVER_CACHE_DIR`**: Must be set to a valid directory path for filesystem resolver ```bash export VLLM_LORA_RESOLVER_CACHE_DIR=/path/to/lora/adapters ``` ### Optional Environment Variables - **`VLLM_PLUGINS`**: If not set, all available plugins will be loaded. If set to empty string, no plugins will be loaded. ## Available Resolvers ### lora_filesystem_resolver The filesystem resolver is installed with vLLM by default and enables loading LoRA adapters from a local directory structure. #### Setup Steps 1. **Create the LoRA adapter storage directory**: ```bash mkdir -p /path/to/lora/adapters ``` 2. **Set environment variables**: ```bash export VLLM_ALLOW_RUNTIME_LORA_UPDATING=true export VLLM_PLUGINS=lora_filesystem_resolver export VLLM_LORA_RESOLVER_CACHE_DIR=/path/to/lora/adapters ``` 3. **Start vLLM server**: Your base model can be `meta-llama/Llama-2-7b-hf`. Please make sure you set up the Hugging Face token in your env var `export HF_TOKEN=xxx235`. ```bash python -m vllm.entrypoints.openai.api_server \ --model your-base-model \ --enable-lora ``` #### Directory Structure Requirements The filesystem resolver expects LoRA adapters to be organized in the following structure: ```text /path/to/lora/adapters/ ├── adapter1/ │ ├── adapter_config.json │ ├── adapter_model.bin │ └── tokenizer files (if applicable) ├── adapter2/ │ ├── adapter_config.json │ ├── adapter_model.bin │ └── tokenizer files (if applicable) └── ... ``` Each adapter directory must contain: - **`adapter_config.json`**: Required configuration file with the following structure: ```json { "peft_type": "LORA", "base_model_name_or_path": "your-base-model-name", "r": 16, "lora_alpha": 32, "target_modules": ["q_proj", "v_proj"], "bias": "none", "modules_to_save": null, "use_rslora": false, "use_dora": false } ``` - **`adapter_model.bin`**: The LoRA adapter weights file #### Usage Example 1. **Prepare your LoRA adapter**: ```bash # Assuming you have a LoRA adapter in /tmp/my_lora_adapter cp -r /tmp/my_lora_adapter /path/to/lora/adapters/my_sql_adapter ``` 2. **Verify the directory structure**: ```bash ls -la /path/to/lora/adapters/my_sql_adapter/ # Should show: adapter_config.json, adapter_model.bin, etc. ``` 3. **Make a request using the adapter**: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my_sql_adapter", "prompt": "Generate a SQL query for:", "max_tokens": 50, "temperature": 0.1 }' ``` #### How It Works 1. When vLLM receives a request for a LoRA adapter named `my_sql_adapter` 2. The filesystem resolver checks if `/path/to/lora/adapters/my_sql_adapter/` exists 3. If found, it validates the `adapter_config.json` file 4. If the configuration matches the base model and is valid, the adapter is loaded 5. The request is processed normally with the newly loaded adapter 6. The adapter remains available for future requests ## Advanced Configuration ### Multiple Resolvers You can configure multiple resolver plugins to load adapters from different sources: 'lora_s3_resolver' is an example of a custom resolver you would need to implement ```bash export VLLM_PLUGINS=lora_filesystem_resolver,lora_s3_resolver ``` All listed resolvers are enabled; at request time, vLLM tries them in order until one succeeds. ### Custom Resolver Implementation To implement your own resolver plugin: 1. **Create a new resolver class**: ```python from vllm.lora.resolver import LoRAResolver, LoRAResolverRegistry from vllm.lora.request import LoRARequest class CustomResolver(LoRAResolver): async def resolve_lora(self, base_model_name: str, lora_name: str) -> Optional[LoRARequest]: # Your custom resolution logic here pass ``` 2. **Register the resolver**: ```python def register_custom_resolver(): resolver = CustomResolver() LoRAResolverRegistry.register_resolver("Custom Resolver", resolver) ``` ## Troubleshooting ### Common Issues 1. **"VLLM_LORA_RESOLVER_CACHE_DIR must be set to a valid directory"** - Ensure the directory exists and is accessible - Check file permissions on the directory 2. **"LoRA adapter not found"** - Verify the adapter directory name matches the requested model name - Check that `adapter_config.json` exists and is valid JSON - Ensure `adapter_model.bin` exists in the directory 3. **"Invalid adapter configuration"** - Verify `peft_type` is set to "LORA" - Check that `base_model_name_or_path` matches your base model - Ensure `target_modules` is properly configured 4. **"LoRA rank exceeds maximum"** - Check that `r` value in `adapter_config.json` doesn't exceed `max_lora_rank` setting ### Debugging Tips 1. **Enable debug logging**: ```bash export VLLM_LOGGING_LEVEL=DEBUG ``` 2. **Verify environment variables**: ```bash echo $VLLM_ALLOW_RUNTIME_LORA_UPDATING echo $VLLM_PLUGINS echo $VLLM_LORA_RESOLVER_CACHE_DIR ``` 3. **Test adapter configuration**: ```bash python -c " import json with open('/path/to/lora/adapters/my_adapter/adapter_config.json') as f: config = json.load(f) print('Config valid:', config) " ``` --- # Metrics vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine. ## Objectives - Provide comprehensive coverage of engine and request level metrics to aid production monitoring. - Prioritize Prometheus integrations, as this is what we expect to be used in production environments. - Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases. ## Background Metrics in vLLM can be categorized as follows: 1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. 2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking. The mental model is that server-level metrics help explain the values of request-level metrics. ### Metrics Overview ### v1 Metrics In v1, an extensive set of metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix, for example: - `vllm:num_requests_running` (Gauge) - Number of requests currently running. - `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1). - `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries. - `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits. - `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed. - `vllm:generation_tokens_total` (Counter) - Total number of generated tokens. - `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason). - `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts. - `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts. - `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT). - `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency. - `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency. - `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time. - `vllm:request_decode_time_seconds` (Histogram) - Request decode time. These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md). ### Grafana Dashboard vLLM also provides [a reference example](../../examples/online_serving/prometheus_grafana/README.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: - `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds. - `vllm:prompt_tokens` - Prompt tokens. - `vllm:generation_tokens` - Generation tokens. - `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds. - `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds. - `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states. - `vllm:kv_cache_usage_perc` - Percentage of used cache blocks by vLLM. - `vllm:request_prompt_tokens` - Request prompt length. - `vllm:request_generation_tokens` - Request generation length. - `vllm:request_success` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached. - `vllm:request_queue_time_seconds` - Queue time. - `vllm:request_prefill_time_seconds` - Requests prefill time. - `vllm:request_decode_time_seconds` - Requests decode time. - `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group. See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pull/2316) for interesting and useful background on the choices made here. ### Prometheus Client Library Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs. During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657): ```bash $ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*' http_requests_total{handler="/v1/completions",method="POST",status="2xx"} 201.0 http_request_size_bytes_count{handler="/v1/completions"} 201.0 http_response_size_bytes_count{handler="/v1/completions"} 201.0 http_request_duration_highr_seconds_count 201.0 http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201.0 ``` ### Multi-process Mode Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See . More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing). ### Built in Python/Process Metrics The following metrics are supported by default by `prometheus_client`, but they are not exposed when multiprocess mode is used: - `python_gc_objects_collected_total` - `python_gc_objects_uncollectable_total` - `python_gc_collections_total` - `python_info` - `process_virtual_memory_bytes` - `process_resident_memory_bytes` - `process_start_time_seconds` - `process_cpu_seconds_total` - `process_open_fds` - `process_max_fds` Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance. ## Metrics Design The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781). ### Legacy PRs To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics: - - - - - ### Metrics Implementation PRs For background, here are the relevant PRs relating to the metrics implementation : - - - - - - - - - - - ### Metrics Collection In v1, we wish to move computation and overhead out of the engine core process to minimize the time between each forward pass. The overall idea of V1 EngineCore design is: - EngineCore is the inner loop. Performance is most critical here - AsyncLLM is the outer loop. This is overlapped with GPU execution (ideally), so this is where any "overheads" should be if possible. So AsyncLLM.output_handler_loop is the ideal place for the metrics bookkeeping if possible. We will achieve this by collecting metrics in the frontend API server, and base these metrics on information we can glean from the `EngineCoreOutputs` returned by the engine core process to the frontend. ### Interval Calculations Many of our metrics are the time interval between various events in the processing of a request. It is best practice to use timestamps based on "monotonic time" (`time.monotonic()`) rather than "wall-clock time" (`time.time()`) to calculate intervals as the former is unaffected by system clock changes (e.g. from NTP). It's also important to note that monotonic clocks differ between processes - each process has its own reference point. So it is meaningless to compare monotonic timestamps from different processes. Therefore, in order to calculate an interval, we must compare two monotonic timestamps from the same process. ### Scheduler Stats The engine core process will collect some key statistics from the scheduler - e.g. the number of requests that were scheduled or waiting after the last scheduler pass - and include those statistics in `EngineCoreOutputs`. ### Engine Core Events The engine core will also record the timestamp of certain per-request events so that the frontend can calculate the interval between these events. The events are: - `QUEUED` - when the request was received by the engine core and added to the scheduler queue. - `SCHEDULED` - when the request was first scheduled for execution. - `PREEMPTED` - the request has been put back in the waiting queue in order to make room for other requests to complete. It will be re-scheduled in future and re-start its prefill phase. - `NEW_TOKENS` - when the output included in `EngineCoreOutput` was generated. Since this is common to all requests in a given iteration, we use a single timestamp on `EngineCoreOutputs` to record this event. And the calculated intervals are: - Queue interval - between `QUEUED` and most recent `SCHEDULED`. - Prefill interval - between most recent `SCHEDULED` and the subsequent first `NEW_TOKENS`. - Decode interval - between first (after the most recent `SCHEDULED`) and last `NEW_TOKENS`. - Inference interval - between most recent `SCHEDULED` and last `NEW_TOKENS`. - Inter-token interval - between successive `NEW_TOKENS`. Put another way: ![Interval calculations - common case](../assets/design/metrics/intervals-1.png) We explored the possibility of having the frontend calculate these intervals using the timing of events visible by the frontend. However, the frontend does not have visibility into the timing of the `QUEUED` and `SCHEDULED` events and, since we need to calculate intervals based on monotonic timestamps from the same process ... we need the engine core to record timestamps for all of these events. #### Interval Calculations vs Preemptions When a preemption occurs during decode, since any already generated tokens are reused, we consider the preemption as affecting the inter-token, decode, and inference intervals. ![Interval calculations - preempted decode](../assets/design/metrics/intervals-2.png) When a preemption occurs during prefill (assuming such an event is possible), we consider the preemption as affecting the time-to-first-token and prefill intervals. ![Interval calculations - preempted prefill](../assets/design/metrics/intervals-3.png) ### Frontend Stats Collection As the frontend processes a single `EngineCoreOutputs` - i.e. the output from a single engine core iteration - it collects various statistics relating to that iteration: - The total number of new tokens generated in this iteration. - The total number of prompt tokens processed by the prefills that completed in this iteration. - The queue intervals for any requests that were scheduled in this iteration. - The prefill intervals for any requests that completed prefill in this iteration. - The inter-token intervals (Time Per Output Token, TPOT), for all requests included in this iteration. - The Time-To-First-Token (TTFT) for any requests that completed prefill in this iteration. However, we calculate this interval relative to when the request was first received by the frontend (`arrival_time`) in order to account for input processing time. For any requests that were completed in a given iteration, we also record: - The inference and decode intervals - relative to the scheduled and first token events, as described above. - End-to-end latency - the interval between frontend `arrival_time` and the frontend receiving the final token. ### KV Cache Residency Metrics We also emit a set of histograms that describe how long sampled KV cache blocks stay resident and how often they are reused. Sampling (`--kv-cache-metrics-sample`) keeps the overhead tiny; when a block is chosen we record: - `lifetime` – allocation ⟶ eviction - `idle before eviction` – last touch ⟶ eviction - `reuse gaps` – the pauses between touches when the block gets reused Those map directly to the Prometheus metrics: - `vllm:kv_block_lifetime_seconds` – how long each sampled block exists. - `vllm:kv_block_idle_before_evict_seconds` – idle tail after the final access. - `vllm:kv_block_reuse_gap_seconds` – time between consecutive touches. The engine core only ships raw eviction events via `SchedulerStats`; the frontend drains them, turns them into Prometheus observations, and also exposes the same data through `LLM.get_metrics()` when logging is on. Looking at lifetime and idle time on one chart makes it easy to spot stranded cache or workloads that pin prompts for a long decode. ### Metrics Publishing - Logging The `LoggingStatLogger` metrics publisher outputs a log `INFO` message every 5 seconds with some key metrics: - The current number of running/waiting requests - The current GPU cache usage - The number of prompt tokens processed per second over the past 5 seconds - The number of new tokens generated per second over the past 5 seconds - The prefix cache hit rate over the most recent 1k kv-cache block queries ### Metrics Publishing - Prometheus The `PrometheusStatLogger` metrics publisher makes the metrics available via a `/metrics` HTTP endpoint in a Prometheus-compatible format. A Prometheus instance can then be configured to poll this endpoint (e.g. every second) and record the values in its time-series database. Prometheus is often used via Grafana, allowing these metrics to be graphed over time. Prometheus supports the following metric types: - Counter: a value that will increase over time, never reducing, and generally reset to zero when the vLLM instance restarts. For example, the number of tokens generated over the lifetime of the instance. - Gauge: a value that goes up and down, for example the number of requests currently scheduled for execution. - Histogram: a count of metric samples, recorded in buckets. For example, the number of requests whose TTFT was <1ms, <5ms, <10ms, <20ms, and so on. Prometheus metrics can also be labelled, allowing metrics to be combined according to matching labels. In vLLM, we add a `model_name` label to every metric which includes the name of the model served by that instance. Example output: ```bash $ curl http://0.0.0.0:8000/metrics # HELP vllm:num_requests_running Number of requests in model execution batches. # TYPE vllm:num_requests_running gauge vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0 ... # HELP vllm:generation_tokens_total Number of generation tokens processed. # TYPE vllm:generation_tokens_total counter vllm:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 27453.0 ... # HELP vllm:request_success_total Count of successfully processed requests. # TYPE vllm:request_success_total counter vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0 vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 131.0 vllm:request_success_total{finished_reason="abort",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 ... # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds. # TYPE vllm:time_to_first_token_seconds histogram vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 13.0 vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 97.0 vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 123.0 vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 138.0 vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 ``` !!! note The choice of histogram buckets to be most useful to users across a broad set of use cases is not straightforward and will require refinement over time. ### Cache Config Info `prometheus_client` has support for [Info metrics](https://prometheus.github.io/client_python/instrumenting/info/) which are equivalent to a `Gauge` whose value is permanently set to 1, but exposes interesting key/value pair information via labels. This is used for information about an instance that does not change - so it only needs to be observed at startup - and allows comparing across instances in Prometheus. We use this concept for the `vllm:cache_config_info` metric: ```text # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # TYPE vllm:cache_config_info gauge vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 ``` However, `prometheus_client` has [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) - for [unclear reasons](gh-pr:7279#discussion_r1710417152). We simply use a `Gauge` metric set to 1 and `multiprocess_mode="mostrecent"` instead. ### LoRA Metrics The `vllm:lora_requests_info` `Gauge` is somewhat similar, except the value is the current wall-clock time, and is updated every iteration. The label names used are: - `running_lora_adapters`: a per-adapter count of the number requests running using that adapter, formatted as a comma-separated string. - `waiting_lora_adapters`: similar, except counting requests that are waiting to be scheduled. - `max_lora` - the static "max number of LoRAs in a single batch." configuration. Encoding a running/waiting counts for multiple adapters in a comma-separated string seems quite misguided - we could use labels to distinguish between per-adapter counts. This should be revisited. Note that `multiprocess_mode="livemostrecent"` is used - the most recent metric is used, but only from currently running processes. This was added in and there is [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If we revisit this design and deprecate the old metric, we should coordinate with downstream users so they can migrate before the removal. ### Prefix Cache metrics The discussion in about adding prefix cache metrics yielded some interesting points which may be relevant to how we approach future metrics. Every time the prefix cache is queried, we record the number of tokens queried and the number of queried tokens present in the cache (i.e. hits). However, the metric of interest is the hit rate - i.e. the number of hits per query. In the case of logging, we expect the user is best served by calculating the hit rate over a fixed number of the most recent queries (the interval is fixed to 1k most recent queries for now). In the case of Prometheus though, we should take advantage of the time-series nature of Prometheus and allow the user to calculate the hit rate over an interval of their choosing. For example, a PromQL query to calculate the hit interval of the past 5 minutes: ```text rate(cache_query_hit[5m]) / rate(cache_query_total[5m]) ``` To achieve this, we should record the queries and hits as counters in Prometheus, rather than recording the hit rate as a gauge. ## Deprecated Metrics ### How To Deprecate Deprecating metrics shouldn't be taken lightly. Users may not notice a metric has been deprecated, and may be quite inconvenienced when it is suddenly (from their perspective) when it is removed, even if there is an equivalent metric for them to use. As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was [deprecated](https://github.com/vllm-project/vllm/pull/2764) (with a comment in the code), [removed](https://github.com/vllm-project/vllm/pull/12383), and then [noticed by a user](https://github.com/vllm-project/vllm/issues/13218). In general: 1. We should be cautious about deprecating metrics, especially since it can be hard to predict the user impact. 2. We should include a prominent deprecation notice in the help string that is included in the `/metrics' output. 3. We should list deprecated metrics in user-facing documentation and release notes. 4. We should consider hiding deprecated metrics behind a CLI argument in order to give administrators [an escape hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) for some time before deleting them. See the [deprecation policy](../contributing/deprecation_policy.md) for the project-wide deprecation policy. ### Unimplemented - `vllm:tokens_total` Added by , but apparently never implemented. This can just be removed. ### Duplicated - Queue Time The `vllm:time_in_queue_requests` Histogram metric was added by and its calculation is: ```python self.metrics.first_scheduled_time = now self.metrics.time_in_queue = now - self.metrics.arrival_time ``` Two weeks later, added `vllm:request_queue_time_seconds` leaving us with: ```python if seq_group.is_finished(): if (seq_group.metrics.first_scheduled_time is not None and seq_group.metrics.first_token_time is not None): time_queue_requests.append( seq_group.metrics.first_scheduled_time - seq_group.metrics.arrival_time) ... if seq_group.metrics.time_in_queue is not None: time_in_queue_requests.append( seq_group.metrics.time_in_queue) ``` This seems duplicative, and one of them should be removed. The latter is used by the Grafana dashboard, so we should deprecate or remove the former. ### Prefix Cache Hit Rate See above - we now expose 'queries' and 'hits' counters rather than a 'hit rate' gauge. ### KV Cache Offloading Two legacy metrics relate to a "swapped" preemption mode that is no longer relevant in v1: - `vllm:num_requests_swapped` - `vllm:cpu_cache_usage_perc` In this mode, when a request is preempted (e.g. to make room in KV cache to complete other requests), we swap kv cache blocks out to CPU memory. This is also known as "KV cache offloading" and is configured with `--swap-space` and `--preemption-mode`. Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The SequenceGroup encapsulated the idea of N Sequences which all shared the same prompt kv blocks. This enabled KV cache block sharing between requests, and copy-on-write to do branching. CPU swapping was intended for these beam search like cases. Later, the concept of prefix caching was introduced, which allowed KV cache blocks to be shared implicitly. This proved to be a better option than CPU swapping since blocks can be evicted slowly on demand and the part of the prompt that was evicted can be recomputed. SequenceGroup was removed in V1, although a replacement will be required for "parallel sampling" (`n>1`). [Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a lot of complex code for a very uncommon feature. In V1, with prefix caching being better (zero over head) and therefore on by default, the preemption and recompute strategy should work better. ## Future Work ### Parallel Sampling Some legacy metrics are only relevant in the context of "parallel sampling". This is where the `n` parameter in a request is used to request multiple completions from the same prompt. As part of adding parallel sampling support in , we should also add these metrics. - `vllm:request_params_n` (Histogram) Observes the value of the 'n' parameter of every finished request. - `vllm:request_max_num_generation_tokens` (Histogram) Observes the maximum output length of all sequences in every finished sequence group. In the absence of parallel sampling, this is equivalent to `vllm:request_generation_tokens`. ### Speculative Decoding Some legacy metrics are specific to "speculative decoding". This is where we generate candidate tokens using a faster, approximate method or model and then validate those tokens with the larger model. - `vllm:spec_decode_draft_acceptance_rate` (Gauge) - `vllm:spec_decode_efficiency` (Gauge) - `vllm:spec_decode_num_accepted_tokens` (Counter) - `vllm:spec_decode_num_draft_tokens` (Counter) - `vllm:spec_decode_num_emitted_tokens` (Counter) There is a PR under review () to add "prompt lookup (ngram)" speculative decoding to v1. Other techniques will follow. We should revisit these metrics in this context. !!! note We should probably expose acceptance rate as separate accepted and draft counters, like we do for prefix caching hit rate. Efficiency likely also needs similar treatment. ### Autoscaling and Load-balancing A common use case for our metrics is to support automated scaling of vLLM instances. For related discussion from the [Kubernetes Serving Working Group](https://github.com/kubernetes/community/tree/master/wg-serving), see: - [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) - [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ) - [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf) - and . This is a non-trivial topic. Consider this comment from Rob: > I think this metric should focus on trying to estimate what the max > concurrency that will cause the average request length > queries per > second ... since this is really what will "saturate" the server. A clear goal is that we should expose the metrics required to detect this saturation point, so administrators can implement auto-scaling rules based on those. However, in order to do so, we need to have a clear view on how an administrator (and automated monitoring system) should judge an instance as approaching saturation: > To identify, what is the saturation point for model server compute > (the inflection point where we cannot get more throughput with a > higher request rate, but start to incur additional latency) so we > can autoscale effectively? ### Metric Naming Our approach to naming metrics probably deserves to be revisited: 1. The use of colons in metric names seems contrary to ["colons are reserved for user defined recording rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). 2. Most of our metrics follow the convention of ending with units, but not all do. 3. Some of our metric names end with `_total`: If there is a suffix of `_total` on the metric name, it will be removed. When exposing the time series for counter, a `_total` suffix will be added. This is for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics requires the `_total` suffix. ### Adding More Metrics There is no shortage of ideas for new metrics: - Examples from other projects like [TGI](https://github.com/IBM/text-generation-inference?tab=readme-ov-file#metrics) - Proposals arising from specific use cases, like the Kubernetes auto-scaling topic above - Proposals that might arise out of standardisation efforts like [OpenTelemetry Semantic Conventions for Gen AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai). We should be cautious in our approach to adding new metrics. While metrics are often relatively straightforward to add: 1. They can be difficult to remove - see the section on deprecation above. 2. They can have a meaningful performance impact when enabled. And metrics are usually of very limited use unless they can be enabled by default and in production. 3. They have an impact on development and maintenance of the project. Every metric added over time has made this effort more time-consuming, and perhaps not all metrics justify this ongoing investment in their maintenance. ## Tracing - OpenTelemetry Metrics provide an aggregated view over time of the system's performance and health. Tracing, on the other hand, tracks individual requests as they move through different services and components. Both fall under the more general heading of "Observability". vLLM has support for OpenTelemetry tracing: - Added by and reinstated by - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) - [User-facing docs](../examples/online_serving/opentelemetry.md) - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) OpenTelemetry has a [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). Since metrics is a big enough topic on its own, we consider the topic of tracing to be quite separate from metrics. ### OpenTelemetry Model Forward vs Execute Time The current implementation exposes the following two metrics: - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent in the model forward pass when this request was in the batch. - `vllm:model_execute_time_milliseconds` (Histogram) - The time spent in the model execute function. This will include model forward, block/sync across workers, cpu-gpu sync time and sampling time. These metrics are only enabled when OpenTelemetry tracing is enabled and if `--collect-detailed-traces=all/model/worker` is used. The documentation for this option states: > collect detailed traces for the specified modules. This involves > use of possibly costly and or blocking operations and hence might > have a performance impact. The metrics were added by and who up in an OpenTelemetry trace as: ```text -> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117) -> gen_ai.latency.time_in_model_forward: Double(3.151565277099609) -> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676) ``` We already have `inference_time` and `decode_time` metrics, so the question is whether there are sufficiently common use cases for the higher-resolution timings to justify the overhead. Since we are going to treat the question of OpenTelemetry support separately, we will include these particular metrics under that topic. --- # Multi-Modal Data Processing To enable various optimizations in vLLM such as [chunked prefill](../configuration/optimization.md#chunked-prefill) and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]: ## Prompt Update Detection One of the main responsibilities of HF processor is to update the prompt with placeholder tokens. For example: - Insert feature placeholder tokens (e.g. `...`, the number of which equals to the feature size) at the start of the string. - Replace existing input placeholder tokens (e.g. `` for a single image) with feature placeholder tokens (e.g. `...`, the number of which equals to the feature size). The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs. In vLLM, this information is specified using [PromptUpdate][vllm.multimodal.processing.PromptUpdate] in [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens. ## Tokenized Prompt Inputs To enable tokenization in a separate process, we support passing input token IDs alongside multi-modal data. ### The problem Consider that HF processors follow these main steps: 1. Tokenize the text 2. Process multi-modal inputs 3. Perform prompt updates And we require that: - For text + multi-modal inputs, apply all steps 1--3. - For tokenized + multi-modal inputs, apply only steps 2--3. How can we achieve this without rewriting HF processors? We can try to call the HF processor several times on different inputs: - For text + multi-modal inputs, simply call the HF processor directly. - For tokenized + multi-modal inputs, call the processor only on the multi-modal inputs. While HF processors support text + multi-modal inputs natively, this is not so for tokenized + multi-modal inputs: an error is thrown if the number of input placeholder tokens do not correspond to the number of multi-modal inputs. Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other. ### Dummy text We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data. ### Automatic prompt updating We address the second issue by implementing model-agnostic code in [_apply_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates] to automatically update the prompt with feature placeholder tokens based on the specification outputted by [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. ### Summary With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in [_apply_hf_processor_main][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main]. ## Processor Output Caching Some HF processors, such as the one for Qwen2-VL, are [very slow](https://github.com/vllm-project/vllm/issues/9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again. When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache. Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other. --- # Fused MoE Kernel Features The purpose of this document is to provide an overview of the various MoE kernels (both modular and non-modular) so it will be easier to select an appropriate set of kernels for any particular situation. This includes information about the all2all backends used by modular kernels. ## Fused MoE Modular All2All backends There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the `FusedMoE` layer. The different `FusedMoEPrepareAndFinalize` subclasses provide an interface for each all2all backend. The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support. The output activation format (standard or batched) corresponds to the output of the prepare step of the `FusedMoEPrepareAndFinalize` subclass, and the finalize step requires the same format. All the backend `prepare` methods expect activations in the standard format and all the `finalize` methods return activations in standard format. More details on the formats can be found in the [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) document. The quantization types and formats enumerate which quantization schemes are supported by each `FusedMoEPrepareAndFinalize` class. The quantization can happen before or after the dispatch based on the format the all2all backend supports, e.g. deepep_high_throughput supports only block-quantized fp8 format. Any other format will result in dispatching in higher precision and quantizing afterwards. The output of the prepare step for each backend is the quantized type. The finalize step generally requires the same input type as the original activations, e.g. if the original input is bfloat16 and the quantization scheme is fp8 with per-tensor scales, `prepare` will return fp8/per-tensor scale activations and `finalize` will take bfloat16 activations. See the diagrams in [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) for more details on the types and formats of activations at each step of the MoE process. If no quantization type is specified, the kernel operates on float16 and/or bfloat16. Async backends support the use of DBO (Dual Batch Overlap) and shared expert overlap (where shared experts are computed during the combine step). Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. Llama. For modular kernels, this feature is supported by the `FusedMoEPrepareAndFinalize` subclass. For non-modular kernels, it is up to the experts function to deal with this flag. Unless otherwise specified, backends are controlled via the `--all2all-backend` command-line argument (or the `all2all_backend` parameter in `ParallelConfig`). All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP. | Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Subclass | |---------|--------------------|--------------|---------------|-------|-----------------------|-----------| | naive | standard | all¹ | G,A,T | N | ⁶ | [layer.py][vllm.model_executor.layers.fused_moe.layer.FusedMoE.forward_impl] | | pplx | batched | fp8,int8 | G,A,T | Y | Y | [`PplxPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.pplx_prepare_finalize.PplxPrepareAndFinalize] | | deepep_high_throughput | standard | fp8 | G(128),A,T² | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] | | deepep_low_latency | batched | fp8 | G(128),A,T³ | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] | | flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferAllToAllMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferAllToAllMoEPrepareAndFinalize] | | flashinfer⁴ | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize] | | MoEPrepareAndFinalizeNoEP⁵ | standard | fp8,int8 | G,A,T | N | Y | [`MoEPrepareAndFinalizeNoEP`][vllm.model_executor.layers.fused_moe.prepare_finalize.MoEPrepareAndFinalizeNoEP] | | BatchedPrepareAndFinalize⁵ | batched | fp8,int8 | G,A,T | N | Y | [`BatchedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedPrepareAndFinalize] | !!! info "Table key" 1. All types: mxfp4, nvfp4, int4, int8, fp8 2. A,T quantization occurs after dispatch. 3. All quantization happens after dispatch. 4. Controlled by different env vars (`VLLM_FLASHINFER_MOE_BACKEND` "throughput" or "latency") 5. This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs without dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the `fused_experts` API. 6. This depends on the experts implementation. --- - G - Grouped - G(N) - Grouped w/block size N - A - Per activation token - T - Per tensor Modular kernels are supported by the following `FusedMoEMethodBase` classes. - [`ModelOptFp8MoEMethod`][vllm.model_executor.layers.quantization.modelopt.ModelOptFp8MoEMethod] - [`Fp8MoEMethod`][vllm.model_executor.layers.quantization.fp8.Fp8MoEMethod] - [`CompressedTensorsW4A4Nvfp4MoEMethod`][vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe.CompressedTensorsW4A4Nvfp4MoEMethod] - [`CompressedTensorsW8A8Fp8MoEMethod`][vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe.CompressedTensorsW8A8Fp8MoEMethod] - [`Mxfp4MoEMethod`][vllm.model_executor.layers.quantization.mxfp4.Mxfp4MoEMethod] - [`UnquantizedFusedMoEMethod`][vllm.model_executor.layers.fused_moe.layer.UnquantizedFusedMoEMethod] ## Fused Experts Kernels There are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters, so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties. Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. `TritonExperts` and `BatchedTritonExperts`. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. `pplx` and `DeepEPLLPrepareAndFinalize`. Similar to the backend kernels, each experts kernel only supports certain quantization formats. For non-modular experts, the activations will be in the original type and quantized internally by the kernel. Modular experts will expect the activations to already be in the quantized format. Both types of experts will yield outputs in the original activation type. Each experts kernel supports one or more activation functions, e.g. silu or gelu, which are applied to the intermediate results. As with the backends, some experts support applying topk weights on the input activations. The entries in the column in this table only apply to the non-modular experts. Most experts flavors include an equivalent modular interface which will be a subclass of `FusedMoEPermuteExpertsUnpermute`. To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels must have compatible activation formats, quantization types and quantization formats. | Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source | |--------|-------------------|--------------|---------------|---------------------|-----------------------|---------|--------| | triton | standard | all¹ | G,A,T | silu, gelu,
swigluoai,
silu_no_mul,
gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],
[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] | | triton (batched) | batched | all¹ | G,A,T | silu, gelu | ⁶ | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] | | deep gemm | standard,
batched | fp8 | G(128),A,T | silu, gelu | ⁶ | Y | [`deep_gemm_moe_fp8`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.deep_gemm_moe_fp8],
[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],
[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] | | cutlass_fp4 | standard,
batched | nvfp4 | A,T | silu | Y | Y | [`cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp4],
[`CutlassExpertsFp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp4] | | cutlass_fp8 | standard,
batched | fp8 | A,T | silu, gelu | Y | Y | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8],
[`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],
[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8] | | flashinfer | standard | nvfp4,
fp8 | T | ⁵ | N | Y | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4],
[`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts] | | gpt oss triton | standard | N/A | N/A | ⁵ | Y | Y | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],
[`OAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.OAITritonExperts] | | marlin | standard,
batched | ³ / N/A | ³ / N/A | silu,
swigluoai | Y | Y | [`fused_marlin_moe`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.fused_marlin_moe],
[`MarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.MarlinExperts],
[`BatchedMarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.BatchedMarlinExperts] | | trtllm | standard | mxfp4,
nvfp4 | G(16),G(32) | ⁵ | N | Y | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts] | | pallas | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_pallas.fused_moe] | | iterative | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_torch_iterative.fused_moe] | | rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_experts] | | cpu_fused_moe | standard | N/A | N/A | silu | N | N | [`CPUFusedMOE`][vllm.model_executor.layers.fused_moe.cpu_fused_moe.CPUFusedMOE] | | naive batched⁴ | batched | int8,
fp8 | G,A,T | silu, gelu | ⁶ | Y | [`NaiveBatchedExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.NaiveBatchedExperts] | !!! info "Table key" 1. All types: mxfp4, nvfp4, int4, int8, fp8 2. A dispatcher wrapper around triton and deep gemm experts. Will select based on type + shape + quantization params 3. uint4, uint8, fp8, fp4 4. This is a naive implementation of experts that supports batched format. Mainly used for testing. 5. The `activation` parameter is ignored and SwiGlu is used by default instead. 6. Only handled by or supported when used with modular kernels. ## Modular Kernel "families" The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts. | backend | `FusedMoEPrepareAndFinalize` subclasses | `FusedMoEPermuteExpertsUnpermute` subclasses | |---------|-----------------------------------------|----------------------------------------------| | deepep_high_throughput | `DeepEPHTPrepareAndFinalize` | `DeepGemmExperts`,
`TritonExperts`,
`TritonOrDeepGemmExperts`,
`CutlassExpertsFp8`,
`MarlinExperts` | | deepep_low_latency,
pplx | `DeepEPLLPrepareAndFinalize`,
`PplxPrepareAndFinalize` | `BatchedDeepGemmExperts`,
`BatchedTritonExperts`,
`CutlassBatchedExpertsFp8`,
`BatchedMarlinExperts` | | flashinfer | `FlashInferCutlassMoEPrepareAndFinalize` | `FlashInferExperts` | --- # Python Multiprocessing ## Debugging Please see the [Troubleshooting](../usage/troubleshooting.md#python-multiprocessing) page for information on known issues and how to solve them. ## Introduction !!! important The source code references are to the state of the code at the time of writing in December 2024. The use of Python multiprocessing in vLLM is complicated by: - The use of vLLM as a library and the inability to control the code using vLLM - Varying levels of incompatibilities between multiprocessing methods and vLLM dependencies This document describes how vLLM deals with these challenges. ## Multiprocessing Methods [Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include: - `spawn` - spawn a new Python process. The default on Windows and macOS. - `fork` - Use `os.fork()` to fork the Python interpreter. The default on Linux for Python versions prior to 3.14. - `forkserver` - Spawn a server process that will fork a new process on request. The default on Linux for Python version 3.14 and newer. ### Tradeoffs `fork` is the fastest method, but is incompatible with dependencies that use threads. If you are under macOS, using `fork` may cause the process to crash. `spawn` is more compatible with dependencies, but can be problematic when vLLM is used as a library. If the consuming code does not use a `__main__` guard (`if __name__ == "__main__":`), the code will be inadvertently re-executed when vLLM spawns a new process. This can lead to infinite recursion, among other problems. `forkserver` will spawn a new server process that will fork new processes on demand. This unfortunately has the same problem as `spawn` when vLLM is used as a library. The server process is created as a spawned new process, which will re-execute code not protected by a `__main__` guard. For both `spawn` and `forkserver`, the process must not depend on inheriting any global state as would be the case with `fork`. ## Compatibility with Dependencies Multiple vLLM dependencies indicate either a preference or requirement for using `spawn`: - - - It is perhaps more accurate to say that there are known problems with using `fork` after initializing these dependencies. ## Current State (v0) The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control which method is used by vLLM. The current default is `fork`. - When we know we own the process because the `vllm` command was used, we use `spawn` because it's the most widely compatible. - The `multiproc_xpu_executor` forces the use of `spawn`. - There are other miscellaneous places hard-coding the use of `spawn`: - - Related PRs: - ## Prior State in v1 There was an environment variable to control whether multiprocessing is used in the v1 engine core, `VLLM_ENABLE_V1_MULTIPROCESSING`. This defaulted to off. - When it was enabled, the v1 `LLMEngine` would create a new process to run the engine core. - - - It was off by default for all the reasons mentioned above - compatibility with dependencies and code using vLLM as a library. ### Changes Made in v1 There is not an easy solution with Python's `multiprocessing` that will work everywhere. As a first step, we can get v1 into a state where it does "best effort" choice of multiprocessing method to maximize compatibility. - Default to `fork`. - Use `spawn` when we know we control the main process (`vllm` was executed). - If we detect `cuda` was previously initialized, force `spawn` and emit a warning. We know `fork` will break, so this is the best we can do. The case that is known to still break in this scenario is code using vLLM as a library that initializes `cuda` before calling vLLM. The warning we emit should instruct users to either add a `__main__` guard or to disable multiprocessing. If that known-failure case occurs, the user will see two messages that explain what is happening. First, a log message from vLLM: ```console WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. ``` Second, Python itself will raise an exception with a nice explanation: ```console RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.html ``` ## Alternatives Considered ### Detect if a `__main__` guard is present It has been suggested that we could behave better if we could detect whether code using vLLM as a library has a `__main__` guard in place. This [post on stackoverflow](https://stackoverflow.com/questions/77220442/multiprocessing-pool-in-a-python-class-without-name-main-guard) was from a library author facing the same question. It is possible to detect whether we are in the original, `__main__` process, or a subsequent spawned process. However, it does not appear to be straight forward to detect whether a `__main__` guard is present in the code. This option has been discarded as impractical. ### Use `forkserver` At first it appears that `forkserver` is a nice solution to the problem. However, the way it works presents the same challenges that `spawn` does when vLLM is used as a library. ### Force `spawn` all the time One way to clean this up is to just force the use of `spawn` all the time and document that the use of a `__main__` guard is required when using vLLM as a library. This would unfortunately break existing code and make vLLM harder to use, violating the desire to make the `LLM` class as easy as possible to use. Instead of pushing this on our users, we will retain the complexity to do our best to make things work. ## Future Work We may want to consider a different worker management approach in the future that works around these challenges. 1. We could implement something `forkserver`-like, but have the process manager be something we initially launch by running our own subprocess and a custom entrypoint for worker management (launch a `vllm-manager` process). 2. We can explore other libraries that may better suit our needs. Examples to consider: - --- # Optimization Levels ## Overview vLLM now supports optimization levels (`-O0`, `-O1`, `-O2`, `-O3`). Optimization levels provide an intuitive mechanism for users to trade startup time for performance. Higher levels have better performance but worse startup time. These optimization levels have associated defaults to help users get desired out-of-the-box performance. Importantly, defaults set by optimization levels are purely defaults; explicit user settings will not be overwritten. ## Level Summaries and Usage Examples ```bash # CLI usage python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O0 # Python API usage from vllm.entrypoints.llm import LLM llm = LLM( model="RedHatAI/Llama-3.2-1B-FP8", optimization_level=0 ) ``` #### `-O1`: Quick Optimizations - **Startup**: Moderate startup time - **Performance**: Inductor compilation, CUDAGraphMode.PIECEWISE - **Use case**: Balance for most development scenarios ```bash # CLI usage python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1 # Python API usage from vllm.entrypoints.llm import LLM llm = LLM( model="RedHatAI/Llama-3.2-1B-FP8", optimization_level=1 ) ``` #### `-O2`: Full Optimizations (Default) - **Startup**: Longer startup time - **Performance**: `-O1` + CUDAGraphMode.FULL_AND_PIECEWISE - **Use case**: Production workloads where performance is important. This is the default use case. It is also very similar to the previous default. The primary difference is that noop & fusion flags are enabled. ```bash # CLI usage (default, so optional) python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O2 # Python API usage from vllm.entrypoints.llm import LLM llm = LLM( model="RedHatAI/Llama-3.2-1B-FP8", optimization_level=2 # This is the default ) ``` #### `-O3`: Full Optimization Still in development. Added infrastructure to prevent changing API in future release. Currently behaves the same O2. ## Troubleshooting ### Common Issues 1. **Startup Time Too Long**: Use `-O0` or `-O1` for faster startup 2. **Compilation Errors**: Use `debug_dump_path` for additional debugging information 3. **Performance Issues**: Ensure using `-O2` for production --- # P2P NCCL Connector An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo. ## Detailed Design ### Overall Process As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. 2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**. 3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**. 4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`. 5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**. 6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**. 7. After completing **Decode**, the D instance returns the result to the **Proxy/Router**, which then forwards it to the **client**. ![image1](https://github.com/user-attachments/assets/fb01bde6-755b-49f7-ad45-48a94b1e10a7) ### Proxy/Router (Demo) A simple HTTP service acts as the entry point for client requests and starts a background thread to listen for P/D instances reporting their HTTP IP and PORT, as well as ZMQ IP and PORT. It maintains a dictionary of `http_addr -> zmq_addr`. The `http_addr` is the IP:PORT for the vLLM instance's request, while the `zmq_addr` is the address for KV cache handshake and metadata reception. The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example: ```text cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0 ``` Currently, to quickly verify whether xPyD can work, a round-robin selection of 1P1D is used. In the future, it is planned to use a trie combined with the load status of instances to select appropriate P and D. Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (currently every 3 seconds) to register (i.e., report `http_addr -> zmq_addr`) and keep the connection alive. If an instance crashes and fails to send a ping for a certain period of time, the Proxy/Router will remove the timed-out instance (this feature has not yet been developed). ### KV Cache Transfer Methods There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache. Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT. ### P2P Communication via ZMQ & NCCL As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart. Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVCache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVCache data itself. When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size. ### NCCL Group Topology Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. ![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36) Each NCCL group occupies a certain amount of GPU memory buffer for communication, the size of which is primarily influenced by the `NCCL_MAX_NCHANNELS` environment variable. When `NCCL_MAX_NCHANNELS=16`, an NCCL group typically occupies 100MB, while when `NCCL_MAX_NCHANNELS=8`, it usually takes up 52MB. For large-scale xPyD configurations—such as DeepSeek's 96P144D—this implementation is currently not feasible. Moving forward, we are considering using RDMA for point-to-point communication and are also keeping an eye on UCCL. ### GPU Memory Buffer and Tensor Memory Pool The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%～10% of the memory size. If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVCache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVCache loss. Once KVCache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance. To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. ## Install vLLM ```shell pip install "vllm>=0.9.2" ``` ## Run xPyD ### Instructions - The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model. - Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput. - For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance. - You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict). - `PUT_ASYNC` offers the best performance and should be prioritized. - The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`. - The `disagg_proxy_p2p_nccl_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances). - The node running the proxy must have `quart` installed. - Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`. - In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**. ### Run 1P3D #### Proxy (e.g. 10.0.1.1) ```shell cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ python3 disagg_proxy_p2p_nccl_xpyd.py & ``` #### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20001 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` #### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20002 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` #### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20003 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` #### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20004 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` ### Run 3P1D #### Proxy (e.g. 10.0.1.1) ```shell cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ python3 disagg_proxy_p2p_nccl_xpyd.py & ``` #### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20001 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` #### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20002 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` #### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20003 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` #### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) ??? console "Command" ```shell CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ --host 0.0.0.0 \ --port 20004 \ --tensor-parallel-size 1 \ --seed 1024 \ --served-model-name base_model \ --dtype float16 \ --max-model-len 10000 \ --max-num-batched-tokens 10000 \ --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` ## Single request ```shell curl -X POST -s http://10.0.1.1:10001/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "base_model", "prompt": "San Francisco is a", "max_tokens": 10, "temperature": 0 }' ``` ## Benchmark ??? console "Command" ```shell vllm bench serve \ --backend vllm \ --model base_model \ --tokenizer meta-llama/Llama-3.1-8B-Instruct \ --dataset-name "random" \ --host 10.0.1.1 \ --port 10001 \ --random-input-len 1024 \ --random-output-len 1024 \ --ignore-eos \ --burstiness 100 \ --percentile-metrics "ttft,tpot,itl,e2el" \ --metric-percentiles "90,95,99" \ --seed $(date +%s) \ --trust-remote-code \ --request-rate 3 \ --num-prompts 1000 ``` ## Shut down ```shell pgrep python | xargs kill -9 && pkill -f python ``` ## Test data ### **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s ![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1) --- # Paged Attention !!! warning This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180). It no longer describes the code used in vLLM today. Currently, vLLM utilizes its own implementation of a multi-head query attention kernel (`csrc/attention/attention_kernels.cu`). This kernel is designed to be compatible with vLLM's paged KV caches, where the key and value cache are stored in separate blocks (note that this block concept differs from the GPU thread block. So in a later document, I will refer to vLLM paged attention block as "block", while refer to GPU thread block as "thread block"). To achieve high performance, this kernel relies on a specially designed memory layout and access method, specifically when threads read data from global memory to shared memory. The purpose of this document is to provide a high-level explanation of the kernel implementation step by step, aiding those who wish to learn about the vLLM multi-head query attention kernel. After going through this document, users will likely have a better understanding and feel easier to follow the actual implementation. Please note that this document may not cover all details, such as how to calculate the correct index for the corresponding data or the dot multiplication implementation. However, after reading this document and becoming familiar with the high-level logic flow, it should be easier for you to read the actual code and understand the details. ## Inputs The kernel function takes a list of arguments for the current thread to perform its assigned work. The three most important arguments are the input pointers `q`, `k_cache`, and `v_cache`, which point to query, key, and value data on global memory that need to be read and processed. The output pointer `out` points to global memory where the result should be written. These four pointers actually refer to multidimensional arrays, but each thread only accesses the portion of data assigned to it. I have omitted all other runtime parameters here for simplicity. ```cpp template __device__ void paged_attention_kernel( ... // Other side args. const scalar_t* __restrict__ out, // [num_seqs, num_heads, max_num_partitions, head_size] const scalar_t* __restrict__ q, // [num_seqs, num_heads, head_size] const scalar_t* __restrict__ k_cache, // [num_blocks, num_kv_heads, head_size/x, block_size, x] const scalar_t* __restrict__ v_cache, // [num_blocks, num_kv_heads, head_size, block_size] ... // Other side args. ) ``` There are also a list of template arguments above the function signature that are determined during compilation time. `scalar_t` represents the data type of the query, key, and value data elements, such as FP16. `HEAD_SIZE` indicates the number of elements in each head. `BLOCK_SIZE` refers to the number of tokens in each block. `NUM_THREADS` denotes the number of threads in each thread block. `PARTITION_SIZE` represents the number of tensor parallel GPUs (For simplicity, we assume this is 0 and tensor parallel is disabled). With these arguments, we need to perform a sequence of preparations. This includes calculating the current head index, block index, and other necessary variables. However, for now, we can ignore these preparations and proceed directly to the actual calculations. It will be easier to understand them once we grasp the entire flow. ## Concepts Just before we dive into the calculation flow, I want to describe a few concepts that are needed for later sections. However, you may skip this section and return later if you encounter any confusing terminologies. - **Sequence**: A sequence represents a client request. For example, the data pointed to by `q` has a shape of `[num_seqs, num_heads, head_size]`. That represents there are total `num_seqs` of query sequence data are pointed by `q`. Since this kernel is a single query attention kernel, each sequence only has one query token. Hence, the `num_seqs` equals the total number of tokens that are processed in the batch. - **Context**: The context consists of the generated tokens from the sequence. For instance, `["What", "is", "your"]` are the context tokens, and the input query token is `"name"`. The model might generate the token `"?"`. - **Vec**: The vec is a list of elements that are fetched and calculated together. For query and key data, the vec size (`VEC_SIZE`) is determined so that each thread group can fetch and calculate 16 bytes of data at a time. For value data, the vec size (`V_VEC_SIZE`) is determined so that each thread can fetch and calculate 16 bytes of data at a time. For example, if the `scalar_t` is FP16 (2 bytes) and `THREAD_GROUP_SIZE` is 2, the `VEC_SIZE` will be 4, while the `V_VEC_SIZE` will be 8. - **Thread group**: The thread group is a small group of threads(`THREAD_GROUP_SIZE`) that fetches and calculates one query token and one key token at a time. Each thread handles only a portion of the token data. The total number of elements processed by one thread group is referred as `x`. For example, if the thread group contains 2 threads and the head size is 8, then thread 0 handles the query and key elements at index 0, 2, 4, 6, while thread 1 handles the elements at index 1, 3, 5, 7. - **Block**: The key and value cache data in vLLM are split into blocks. Each block stores data for a fixed number(`BLOCK_SIZE`) of tokens at one head. Each block may contain only a portion of the whole context tokens. For example, if the block size is 16 and the head size is 128, then for one head, one block can store 16 * 128 = 2048 elements. - **Warp**: A warp is a group of 32 threads(`WARP_SIZE`) that execute simultaneously on a stream multiprocessor (SM). In this kernel, each warp processes the calculation between one query token and key tokens of one entire block at a time (it may process multiple blocks in multiple iterations). For example, if there are 4 warps and 6 blocks for one context, the assignment would be like warp 0 handles the 0th, 4th blocks, warp 1 handles the 1st, 5th blocks, warp 2 handles the 2nd block and warp 3 handles the 3rd block. - **Thread block**: A thread block is a group of threads(`NUM_THREADS`) that can access the same shared memory. Each thread block contains multiple warps(`NUM_WARPS`), and in this kernel, each thread block processes the calculation between one query token and key tokens of a whole context. - **Grid**: A grid is a collection of thread blocks and defines the shape of the collection. In this kernel, the shape is `(num_heads, num_seqs, max_num_partitions)`. Therefore, each thread block only handles the calculation for one head, one sequence, and one partition. ## Query This section will introduce how query data is stored in memory and fetched by each thread. As mentioned above, each thread group fetches one query token data, while each thread itself only handles a part of one query token data. Within each warp, every thread group will fetch the same query token data, but will multiply it with different key token data. ```cpp const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE; ```

Each thread defines its own `q_ptr` which points to the assigned query token data on global memory. For example, if `VEC_SIZE` is 4 and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains total of 128 elements divided into 128 / 4 = 32 vecs.

```cpp __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD]; ``` Next, we need to read the global memory data pointed to by `q_ptr` into shared memory as `q_vecs`. It is important to note that each vecs is assigned to a different row. For example, if the `THREAD_GROUP_SIZE` is 2, thread 0 will handle the 0th row vecs, while thread 1 handles the 1st row vecs. By reading the query data in this way, neighboring threads like thread 0 and thread 1 can read neighbor memory, achieving the memory coalescing to improve performance. ## Key Similar to the "Query" section, this section introduces memory layout and assignment for keys. While each thread group only handle one query token one kernel run, it may handle multiple key tokens across multiple iterations. Meanwhile, each warp will process multiple blocks of key tokens in multiple iterations, ensuring that all context tokens are processed by the entire thread group after the kernel run. In this context, "handle" refers to performing the dot multiplication between query data and key data. ```cpp const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride + kv_head_idx * kv_head_stride + physical_block_offset * x; ``` Unlike to `q_ptr`, `k_ptr` in each thread will point to different key token at different iterations. As shown above, that `k_ptr` points to key token data based on `k_cache` at assigned block, assigned head and assigned token.

The diagram above illustrates the memory layout for key data. It assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is 8, `THREAD_GROUP_SIZE` is 2, and there are a total of 4 warps. Each rectangle represents all the elements for one key token at one head, which will be processed by one thread group. The left half shows the total 16 blocks of key token data for warp 0, while the right half represents the remaining key token data for other warps or iterations. Inside each rectangle, there are a total 32 vecs (128 elements for one token) that will be processed by 2 threads (one thread group) separately.

```cpp K_vec k_vecs[NUM_VECS_PER_THREAD] ``` Next, we need to read the key token data from `k_ptr` and store them on register memory as `k_vecs`. We use register memory for `k_vecs` because it will only be accessed by one thread once, whereas `q_vecs` will be accessed by multiple threads multiple times. Each `k_vecs` will contain multiple vectors for later calculation. Each vec will be set at each inner iteration. The assignment of vecs allows neighboring threads in a warp to read neighboring memory together, which again promotes the memory coalescing. For instance, thread 0 will read vec 0, while thread 1 will read vec 1. In the next inner loop, thread 0 will read vec 2, while thread 1 will read vec 3, and so on. You may still be a little confused about the overall flow. Don't worry, please keep reading the next "QK" section. It will illustrate the query and key calculation flow in a clearer and higher-level manner. ## QK As shown the pseudocode below, before the entire for loop block, we fetch the query data for one token and store it in `q_vecs`. Then, in the outer for loop, we iterate through different `k_ptrs` that point to different tokens and prepare the `k_vecs` in the inner for loop. Finally, we perform the dot multiplication between the `q_vecs` and each `k_vecs`. ```cpp q_vecs = ... for ... { k_ptr = ... for ... { k_vecs[i] = ... } ... float qk = scale * Qk_dot::dot(q_vecs[thread_group_offset], k_vecs); } ``` As mentioned before, for each thread, it only fetches part of the query and key token data at a time. However, there will be a cross thread group reduction happen in the `Qk_dot<>::dot` . So `qk` returned here is not just between part of the query and key token dot multiplication, but actually a full result between entire query and key token data. For example, if the value of `HEAD_SIZE` is 128 and `THREAD_GROUP_SIZE` is 2, each thread's `k_vecs` will contain total 64 elements. However, the returned `qk` is actually the result of dot multiplication between 128 query elements and 128 key elements. If you want to learn more about the details of the dot multiplication and reduction, you may refer to the implementation of `Qk_dot<>::dot`. However, for the sake of simplicity, I will not cover it in this document. ## Softmax Next, we need to calculate the normalized softmax for all `qk`s, as shown above, where each $x$ represents a `qk`. To do this, we must obtain the reduced value of `qk_max`($m(x)$) and the `exp_sum`($\ell(x)$) of all `qk`s. The reduction should be performed across the entire thread block, encompassing results between the query token and all context key tokens. $$ \begin{gather*} m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\ \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)} \end{gather*} $$ ### `qk_max` and `logits` Just right after we get the `qk` result, we can set the temporary `logits` result with `qk` (In the end, the `logits` should store the normalized softmax result). Also we can compare and collect the `qk_max` for all `qk`s that are calculated by current thread group. ```cpp if (thread_group_offset == 0) { const bool mask = token_idx >= context_len; logits[token_idx - start_token_idx] = mask ? 0.f : qk; qk_max = mask ? qk_max : fmaxf(qk_max, qk); } ``` Please note that the `logits` here is on shared memory, so each thread group will set the fields for its own assigned context tokens. Overall, the size of logits should be number of context tokens. ```cpp for (int mask = WARP_SIZE / 2; mask >= THREAD_GROUP_SIZE; mask /= 2) { qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask)); } if (lane == 0) { red_smem[warp_idx] = qk_max; } ``` Then we need to get the reduced `qk_max` across each warp. The main idea is to make threads in warp to communicate with each other and get the final max `qk` . ```cpp for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) { qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask)); } qk_max = VLLM_SHFL_SYNC(qk_max, 0); ``` Finally, we can get the reduced `qk_max` from whole thread block by compare the `qk_max` from all warps in this thread block. Then we need to broadcast the final result to each thread. ### `exp_sum` Similar to `qk_max`, we need to get the reduced sum value from the entire thread block too. ```cpp for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) { float val = __expf(logits[i] - qk_max); logits[i] = val; exp_sum += val; } ... exp_sum = block_sum(&red_smem[NUM_WARPS], exp_sum); ``` Firstly, sum all exp values from each thread group, and meanwhile, convert each entry of `logits` from `qk` to `exp(qk - qk_max)`. Please note, the `qk_max` here is already the max `qk` across the whole thread block. And then we can do reduction for `exp_sum` across whole thread block just like the `qk_max`. ```cpp const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f); for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) { logits[i] *= inv_sum; } ``` Finally, with the reduced `qk_max` and `exp_sum`, we can obtain the final normalized softmax result as `logits`. This `logits` variable will be used for dot multiplication with the value data in later steps. Now, it should store the normalized softmax result of `qk` for all assigned context tokens. ## Value

Now we need to retrieve the value data and perform dot multiplication with `logits`. Unlike query and key, there is no thread group concept for value data. As shown in diagram, different from key token memory layout, elements from the same column correspond to the same value token. For one block of value data, there are `HEAD_SIZE` of rows and `BLOCK_SIZE` of columns that are split into multiple `v_vecs`. Each thread always fetches `V_VEC_SIZE` elements from the same `V_VEC_SIZE` of tokens at a time. As a result, a single thread retrieves multiple `v_vec`s from different rows and the same columns through multiple inner iterations. For each `v_vec`, it needs to be dot multiplied with the corresponding `logits_vec`, which is also `V_VEC_SIZE` elements from `logits`. Overall, with multiple inner iterations, each warp will process one block of value tokens. And with multiple outer iterations, the whole context value tokens are processed ```cpp float accs[NUM_ROWS_PER_THREAD]; for ... { // Iteration over different blocks. logits_vec = ... for ... { // Iteration over different rows. v_vec = ... ... accs[i] += dot(logits_vec, v_vec); } } ``` As shown in the above pseudocode, in the outer loop, similar to `k_ptr`, `logits_vec` iterates over different blocks and reads `V_VEC_SIZE` elements from `logits`. In the inner loop, each thread reads `V_VEC_SIZE` elements from the same tokens as a `v_vec` and performs dot multiplication. It is important to note that in each inner iteration, the thread fetches different head position elements for the same tokens. The dot result is then accumulated in `accs`. Therefore, each entry of `accs` is mapped to a head position assigned to the current thread. For example, if `BLOCK_SIZE` is 16 and `V_VEC_SIZE` is 8, each thread fetches 8 value elements for 8 tokens at a time. Each element is from different tokens at the same head position. If `HEAD_SIZE` is 128 and `WARP_SIZE` is 32, for each inner loop, a warp needs to fetch `WARP_SIZE * V_VEC_SIZE = 256` elements. This means there are a total of 128 * 16 / 256 = 8 inner iterations for a warp to handle a whole block of value tokens. And each `accs` in each thread contains 8 elements that accumulated at 8 different head positions. For the thread 0, the `accs` variable will have 8 elements, which are 0th, 32nd … 224th elements of a value head that are accumulated from all assigned 8 tokens. ## LV Now, we need to perform reduction for `accs` within each warp. This process allows each thread to accumulate the `accs` for the assigned head positions of all tokens in one block. ```cpp for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { float acc = accs[i]; for (int mask = NUM_V_VECS_PER_ROW / 2; mask >= 1; mask /= 2) { acc += VLLM_SHFL_XOR_SYNC(acc, mask); } accs[i] = acc; } ``` Next, we perform reduction for `accs` across all warps, allowing each thread to have the accumulation of `accs` for the assigned head positions of all context tokens. Please note that each `accs` in every thread only stores the accumulation for a portion of elements of the entire head for all context tokens. However, overall, all results for output have been calculated but are just stored in different thread register memory. ??? code ```cpp float* out_smem = reinterpret_cast(shared_mem); for (int i = NUM_WARPS; i > 1; i /= 2) { // Upper warps write to shared memory. ... float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE]; for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { ... dst[row_idx] = accs[i]; } // Lower warps update the output. const float* src = &out_smem[warp_idx * HEAD_SIZE]; for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { ... accs[i] += src[row_idx]; } // Write out the accs. } ``` ## Output Now we can write all of calculated result from local register memory to final output global memory. ```cpp scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE + head_idx * max_num_partitions * HEAD_SIZE + partition_idx * HEAD_SIZE; ``` First, we need to define the `out_ptr` variable, which points to the start address of the assigned sequence and assigned head. ```cpp for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER; if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) { from_float(*(out_ptr + row_idx), accs[i]); } } ``` Finally, we need to iterate over different assigned head positions and write out the corresponding accumulated result based on the `out_ptr`. ## Citation ```bibtex @inproceedings{kwon2023efficient, title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, year={2023} } ``` --- # Plugin System The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. ## How Plugins Work in vLLM Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_plugins_by_group][vllm.plugins.load_plugins_by_group] function in the `vllm.plugins` module. ## How vLLM Discovers Plugins vLLM's plugin system uses the standard Python `entry_points` mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin: ??? code ```python # inside `setup.py` file from setuptools import setup setup(name='vllm_add_dummy_model', version='0.1', packages=['vllm_add_dummy_model'], entry_points={ 'vllm.general_plugins': ["register_dummy_model = vllm_add_dummy_model:register"] }) # inside `vllm_add_dummy_model.py` file def register(): from vllm import ModelRegistry if "MyLlava" not in ModelRegistry.get_supported_archs(): ModelRegistry.register_model( "MyLlava", "vllm_add_dummy_model.my_llava:MyLlava", ) ``` For more information on adding entry points to your package, please check the [official documentation](https://setuptools.pypa.io/en/latest/userguide/entry_point.html). Every plugin has three parts: 1. **Plugin group**: The name of the entry point group. vLLM uses the entry point group `vllm.general_plugins` to register general plugins. This is the key of `entry_points` in the `setup.py` file. Always use `vllm.general_plugins` for vLLM's general plugins. 2. **Plugin name**: The name of the plugin. This is the value in the dictionary of the `entry_points` dictionary. In the example above, the plugin name is `register_dummy_model`. Plugins can be filtered by their names using the `VLLM_PLUGINS` environment variable. To load only a specific plugin, set `VLLM_PLUGINS` to the plugin name. 3. **Plugin value**: The fully qualified name of the function or module to register in the plugin system. In the example above, the plugin value is `vllm_add_dummy_model:register`, which refers to a function named `register` in the `vllm_add_dummy_model` module. ## Types of supported plugins - **General plugins** (with group name `vllm.general_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree models into vLLM. This is done by calling `ModelRegistry.register_model` to register the model inside the plugin function. - **Platform plugins** (with group name `vllm.platform_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree platforms into vLLM. The plugin function should return `None` when the platform is not supported in the current environment, or the platform class's fully qualified name when the platform is supported. - **IO Processor plugins** (with group name `vllm.io_processor_plugins`): The primary use case for these plugins is to register custom pre-/post-processing of the model prompt and model output for pooling models. The plugin function returns the IOProcessor's class fully qualified name. - **Stat logger plugins** (with group name `vllm.stat_logger_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree loggers into vLLM. The entry point should be a class that subclasses StatLoggerBase. ## Guidelines for Writing Plugins - **Being re-entrant**: The function specified in the entry point should be re-entrant, meaning it can be called multiple times without causing issues. This is necessary because the function might be called multiple times in some processes. ### Platform plugins guidelines 1. Create a platform plugin project, for example, `vllm_add_dummy_platform`. The project structure should look like this: ```shell vllm_add_dummy_platform/ ├── vllm_add_dummy_platform/ │ ├── __init__.py │ ├── my_dummy_platform.py │ ├── my_dummy_worker.py │ ├── my_dummy_attention.py │ ├── my_dummy_device_communicator.py │ ├── my_dummy_custom_ops.py ├── setup.py ``` 2. In the `setup.py` file, add the following entry point: ```python setup( name="vllm_add_dummy_platform", ... entry_points={ "vllm.platform_plugins": [ "my_dummy_platform = vllm_add_dummy_platform:register" ] }, ... ) ``` Please make sure `vllm_add_dummy_platform:register` is a callable function and returns the platform class's fully qualified name. for example: ```python def register(): return "vllm_add_dummy_platform.my_dummy_platform.MyDummyPlatform" ``` 3. Implement the platform class `MyDummyPlatform` in `my_dummy_platform.py`. The platform class should inherit from `vllm.platforms.interface.Platform`. Please follow the interface to implement the functions one by one. There are some important functions and properties that should be implemented at least: - `_enum`: This property is the device enumeration from [PlatformEnum][vllm.platforms.interface.PlatformEnum]. Usually, it should be `PlatformEnum.OOT`, which means the platform is out-of-tree. - `device_type`: This property should return the type of the device which pytorch uses. For example, `"cpu"`, `"cuda"`, etc. - `device_name`: This property is set the same as `device_type` usually. It's mainly used for logging purposes. - `check_and_update_config`: This function is called very early in the vLLM's initialization process. It's used for plugins to update the vllm configuration. For example, the block size, graph mode config, etc, can be updated in this function. The most important thing is that the **worker_cls** should be set in this function to let vLLM know which worker class to use for the worker process. - `get_attn_backend_cls`: This function should return the attention backend class's fully qualified name. - `get_device_communicator_cls`: This function should return the device communicator class's fully qualified name. 4. Implement the worker class `MyDummyWorker` in `my_dummy_worker.py`. The worker class should inherit from [WorkerBase][vllm.v1.worker.worker_base.WorkerBase]. Please follow the interface to implement the functions one by one. Basically, all interfaces in the base class should be implemented, since they are called here and there in vLLM. To make sure a model can be executed, the basic functions should be implemented are: - `init_device`: This function is called to set up the device for the worker. - `initialize_cache`: This function is called to set cache config for the worker. - `load_model`: This function is called to load the model weights to device. - `get_kv_cache_spec`: This function is called to generate the kv cache spec for the model. - `determine_available_memory`: This function is called to profiles the peak memory usage of the model to determine how much memory can be used for KV cache without OOMs. - `initialize_from_config`: This function is called to allocate device KV cache with the specified kv_cache_config - `execute_model`: This function is called every step to inference the model. Additional functions that can be implemented are: - If the plugin wants to support sleep mode feature, please implement the `sleep` and `wakeup` functions. - If the plugin wants to support graph mode feature, please implement the `compile_or_warm_up_model` function. - If the plugin wants to support speculative decoding feature, please implement the `take_draft_token_ids` function. - If the plugin wants to support lora feature, please implement the `add_lora`,`remove_lora`,`list_loras` and `pin_lora` functions. - If the plugin wants to support data parallelism feature, please implement the `execute_dummy_batch` functions. Please look at the worker base class [WorkerBase][vllm.v1.worker.worker_base.WorkerBase] for more functions that can be implemented. 5. Implement the attention backend class `MyDummyAttention` in `my_dummy_attention.py`. The attention backend class should inherit from [AttentionBackend][vllm.attention.backends.abstract.AttentionBackend]. It's used to calculate attentions with your device. Take `vllm.v1.attention.backends` as examples, it contains many attention backend implementations. 6. Implement custom ops for high performance. Most ops can be ran by pytorch native implementation, while the performance may not be good. In this case, you can implement specific custom ops for your plugins. Currently, there are kinds of custom ops vLLM supports: - pytorch ops there are 3 kinds of pytorch ops: - `communicator ops`: Device communicator op. Such as all-reduce, all-gather, etc. Please implement the device communicator class `MyDummyDeviceCommunicator` in `my_dummy_device_communicator.py`. The device communicator class should inherit from [DeviceCommunicatorBase][vllm.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase]. - `common ops`: Common ops. Such as matmul, softmax, etc. Please implement the common ops by register oot way. See more detail in [CustomOp][vllm.model_executor.custom_op.CustomOp] class. - `csrc ops`: C++ ops. This kind of ops are implemented in C++ and are registered as torch custom ops. Following csrc module and `vllm._custom_ops` to implement your ops. - triton ops Custom way doesn't work for triton ops now. 7. (optional) Implement other plugable modules, such as lora, graph backend, quantization, mamba attention backend, etc. ## Compatibility Guarantee vLLM guarantees the interface of documented plugins, such as `ModelRegistry.register_model`, will always be available for plugins to register models. However, it is the responsibility of plugin developers to ensure their plugins are compatible with the version of vLLM they are targeting. For example, `"vllm_add_dummy_model.my_llava:MyLlava"` should be compatible with the version of vLLM that the plugin targets. The interface for the model/module may change during vLLM's development. If you see any deprecation log info, please upgrade your plugin to the latest version. ## Deprecation announcement !!! warning "Deprecations" - `use_v1` parameter in `Platform.get_attn_backend_cls` is deprecated. It has been removed in v0.13.0. - `_Backend` in `vllm.attention` is deprecated. It has been removed in v0.13.0. Please use `vllm.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead. --- # Automatic Prefix Caching Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc.) and most open source LLM inference frameworks (e.g., SGLang). While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block: ```text Block 1 Block 2 Block 3 [A gentle breeze stirred] [the leaves as children] [laughed in the distance] Block 1: |<--- block tokens ---->| Block 2: |<------- prefix ------>| |<--- block tokens --->| Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->| ``` In the example above, the KV cache in the first block can be uniquely identified with the token “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the block hash of `hash(tuple[components])`, where components are: * Parent hash value: The hash value of the parent hash block. * Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision. * Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments. !!! note "Note 1" We only cache full blocks. !!! note "Note 2" The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we use SHA256** as hash function instead of the builtin hash. SHA256 is supported since vLLM v0.8.3 and the default since v0.10.2. It comes with a negligible performance impact of about 75ns per token (<4ms for 50k tokens of context). **A hashing example with multi-modality inputs** In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: ```text messages = [ {"role": "user", "content": [ {"type": "text", "text": "What's in this image?" }, {"type": "image_url", "image_url": {"url": image_url}, }, ]}, ] ``` It will become the following prompt: ```text Prompt: [INST]What's in this image?\n[IMG][/INST] Tokenized prompt: [1, 3, 7493, 1681, 1294, 1593, 3937, 9551, 10, 4] Prompt with placeholders (
): [1, 3, 7493, 1681, 1294, 1593, 3937, 9551,
,
, ...,
, 4] ``` As we can see, after the tokenization, the `[IMG]` will be replaced by a sequence of placeholder tokens, and these placeholders will be replaced by image embeddings during prefill. The challenge for prefix caching to support this case is we need to differentiate images from the placeholders. To address this problem, we encode the image hash generated by the frontend image processor. For example, the hash of the blocks in the above prompt would be (assuming block size 16, and we have 41 placeholder tokens): ```text Block 0 Parent hash: None Token IDs: 1, 3, 7493, 1681, 1294, 1593, 3937, 9551,
, ...,
Extra hash: Block 1 Parent hash: Block 0 hash Token IDs:
, ...,
Extra hash: Block 2 Parent hash: Block 1 hash Token IDs:
, ...,
Extra hash: Block 3 Parent hash: Block 2 hash Token IDs:
, ...,
, 4 Extra hash: ``` In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow. **Cache Isolation for Security** To improve privacy in shared environments, vLLM supports isolating prefix cache reuse through optional per-request salting. By including a `cache_salt` in the request, this value is injected into the hash of the first block, ensuring that only requests with the same salt can reuse cached KV blocks. This prevents timing-based attacks where an adversary could infer cached content by observing latency differences. This offers protection without compromising performance. ```json { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Here is a document with details about the world series: ..."}, {"role": "user", "content": "Who won the world series in 2020?"} ], "cache_salt": "your-cache-salt" } ``` With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. ## Data Structure The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified): ```python class KVCacheBlock: # The block ID (immutable) block_id: int # The block hash (will be assigned when the block is full, # and will be reset when the block is evicted). block_hash: BlockHash # The number of requests using this block now. ref_cnt: int # The pointers to form a doubly linked list for the free queue. prev_free_block: "KVCacheBlock | None" = None next_free_block: "KVCacheBlock | None" = None ``` There are two design points to highlight: 1. We allocate all KVCacheBlock when initializing the KV cache manager to be a block pool. This avoids Python object creation overheads and can easily track all blocks all the time. 2. We introduce doubly linked list pointers directly in the KVCacheBlock, so that we could construct a free queue directly. This gives us two benefits: 1. We could have O(1) complexity moving elements in the middle to the tail. 2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements. As a result, we will have the following components when the KV cache manager is initialized: ![Component Overview](../assets/design/prefix_caching/overview.png) * Block Pool: A list of KVCacheBlock. * Free Block Queue: Only store the pointers of head and tail blocks for manipulations. * Cache blocks: Mapping from hash key to block IDs. * Request blocks: Mapping from request ID to allocated block IDs. ## Operations ### Block Allocation **New request:** Workflow for the scheduler to schedule a new request with KV cache block allocation: 1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up cache blocks. 2. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps: 1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate. 2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration. 3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on. 4. If an allocated block is already full of tokens, we immediately add it to the cache block, so that the block can be reused by other requests in the same batch. **Running request:** Workflow for the scheduler to schedule a running request with KV cache block allocation: 1. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps: 1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate. 2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on. 3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the cache block to cache it. **Duplicated blocks** Assuming block size is 4 and you send a request (Request 1\) with prompt ABCDEF and decoding length 3: ```text Prompt: [A, B, C, D, E, F] Output: [G, H, I] Time 0: Tokens: [A, B, C, D, E, F, G] Block Table: [0 (ABCD), 1 (EFG)] Cache Blocks: 0 Time 1: Tokens: [A, B, C, D, E, F, G, H] Block Table: [0 (ABCD), 1 (EFGH)] Cache Blocks: 0, 1 Time 2: Tokens: [A, B, C, D, E, F, G, H, I] Block Table: [0 (ABCD), 1 (EFGH), 2 (I)] Cache Blocks: 0, 1 ``` Now block 0 and block 1 are cached, and we send the same request again (Request 2\) with greedy sampling, so that it will produce exactly the same outputs as the Request 1: ```text Prompt: [A, B, C, D, E, F] Output: [G, H, I] Time 0: Tokens: [A, B, C, D, E, F, G] Block Table: [0 (ABCD), 3 (EFG)] Cache Blocks: 0, 1 Time 1: Tokens: [A, B, C, D, E, F, G, H] Block Table: [0 (ABCD), 3 (EFGH)] Cache Blocks: 0, 1, 3 ``` As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed. ### Free When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first. ![Free queue after a request us freed](../assets/design/prefix_caching/free.png) ### Eviction (LRU) When the head block (least recently used block) of the free queue is cached, we have to evict the block to prevent it from being used by other requests. Specifically, eviction involves the following steps: 1. Pop the block from the head of the free queue. This is the LRU block to be evicted. 2. Remove the block ID from the cache block. 3. Remove the block hash. ## Example In this example, we assume the block size is 4 (each block can cache 4 tokens), and we have 10 blocks in the KV-cache manager in total. **Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens. ![Example Time 1](../assets/design/prefix_caching/example-time-1.png) **Time 2: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. ![Example Time 2](../assets/design/prefix_caching/example-time-3.png) **Time 3: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. ![Example Time 3](../assets/design/prefix_caching/example-time-4.png) **Time 4: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. ![Example Time 4](../assets/design/prefix_caching/example-time-5.png) **Time 5: Request 1 is finished and free.** ![Example Time 5](../assets/design/prefix_caching/example-time-6.png) **Time 6: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). ![Example Time 6](../assets/design/prefix_caching/example-time-7.png) --- # `torch.compile` integration In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage. Throughout the example, we will run a common Llama model, and turn on debug level logging to show all the details. The command to be used is `VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.2-1B`. !!! note For more information and the latest progress of `torch.compile` integration, see this [Blog Post](https://blog.vllm.ai/2025/08/20/torch-compile.html). ## Compilation Cache In the very verbose logs, we can see: ```console INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile ``` vLLM will take all the available factors into consideration, and decide a directory to store all the compilation artifact. This means, you can directly copy the whole `~/.cache/vllm/torch_compile_cache` directory in your deployment scenario to save a great amount of compilation time, and hence accelerating the starting time of the vLLM instance. The factors considered include: - All the related configs (see the `compute_hash` functions in their respective configs in the [config folder](../../vllm/config)) - PyTorch configs (see the `compute_hash` functions in the [compiler_interface.py](../../vllm/compilation/compiler_interface.py)) - The model's forward function and the relevant functions called by the forward function (see below) With all these factors taken into consideration, usually we can guarantee that the cache is safe to use, and will not cause any unexpected behavior. Therefore, the cache is enabled by default. If you want to debug the compilation process, or if you suspect the cache is causing some issues, you can disable it by setting the environment variable `VLLM_DISABLE_COMPILE_CACHE=1`. A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all the compilation finishes before we serve any requests. No requests will trigger new compilations. Otherwise, the engine would be blocked on that request, and the response time will have unexpected spikes. By default, the cache saves compiled artifacts as binary files. If you would like to interact with the generated code for debugging purposes, set the field `compile_cache_save_format=unpacked` in the compilation config, or omit this and set the env variable `VLLM_COMPILE_CACHE_SAVE_FORMAT=unpacked`. ## Dynamic shapes and vllm guard dropping `torch.compile` is designed to guard on dynamic shapes with no hesitation when needed. This contradicts with vLLM's `torch.compile` approach of dropping the guards since many of those guards could be material. `torch.compile` provides two kinds of dynamic shapes: `backed` and `unbacked`. `torch.compile` guards on `backed` dynamic shapes and does not provide a guarantee that no guards will be added to them. User code, dynamo, inductor, and autograd all can add guards. Moreover, for 0/1 specializations, backed symbols are specialized unconditionally to 0, 1, or >=2 even without encountering a branching on those ranges. On the contrary, `unbacked` dynamic shapes are guaranteed not to be guarded on and are not 0/1 specialized. However, there is a possibility of throwing a data dependent error when a branch that requires their value is encountered and no explicit unbacked handling is defined. The framework is converging to a state where it won't throw DDE but rather pick general paths. One downside of using unbacked is missed optimization opportunities due to either perf bugs or picking general paths, also using a fixed non-example input-based hint (this will be fixed soon with override_hint API). An example of picking general paths is assuming input not contiguous in functions call contiguous() and reshape() when can't be symbolically proven with a change of introducing a clone. `backed_size_oblivious` is a flag that enables treating backed symbols as unbacked wherever explicit handling for unbacked is defined. With this mode, 0/1 specializations are mostly avoided in framework code and the default 0/1 specialization does not happen. However, there is still no guarantee that torch.compile won't guard, especially due to user code or custom passes. `backed_size_oblivious` is experimental in PyTorch compile and could be deprecated. That said, it's a safer option to use than `backed` and the probability of reducing performance is lower than `unbacked`. ### Configuring Dynamic Shapes The `DynamicShapesConfig` allows you to control the dynamic shapes behavior by setting the `type` field. You can choose between three modes: `BACKED`(default), `UNBACKED` , and `BACKED_SIZE_OBLIVIOUS`. #### Offline Inference Example (Using LLM class) When using the `LLM` class for offline inference, you can configure dynamic shapes through the `compilation_config` parameter: ```python from vllm import LLM, SamplingParams from vllm.config.compilation import CompilationConfig, DynamicShapesConfig, DynamicShapesType # Example: Using backed_size_oblivious (experimental, safer than backed) llm = LLM( model="meta-llama/Llama-3.2-1B", compilation_config=CompilationConfig( dynamic_shapes_config=DynamicShapesConfig( type=DynamicShapesType.BACKED_SIZE_OBLIVIOUS ) ) ) # Example: Using unbacked (strongest guarantee against guards) llm = LLM( model="meta-llama/Llama-3.2-1B", compilation_config=CompilationConfig( dynamic_shapes_config=DynamicShapesConfig( type=DynamicShapesType.UNBACKED ) ) ) # Generate outputs prompts = ["Hello, my name is", "The future of AI is"] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) outputs = llm.generate(prompts, sampling_params) ``` #### Online Serving Example (Using vllm serve) When using `vllm serve` for online serving, you can configure dynamic shapes through the `--compilation-config` flag: ```bash # Example: Using unbacked vllm serve meta-llama/Llama-3.2-1B \ --compilation-config '{"dynamic_shapes_config": {"type": "unbacked"}}' # Alternative: Using dot notation (simpler for single values) vllm serve meta-llama/Llama-3.2-1B -cc.dynamic_shapes_config.type=unbacked ``` #### Choosing the Right Mode - **BACKED** (default): Use when you're willing to accept potential unsafe dropping of guards for maximal performance. Guard could be unsoundly added and then ignored. - **UNBACKED** Use when you need the strongest guarantee against guards. This is the most conservative option but may miss some optimization opportunities. - **BACKED_SIZE_OBLIVIOUS**: Use when you want a balance between avoiding guards and performance. This experimental mode is safer than BACKED but still not as conservative as UNBACKED. ## Python Code Compilation In the very verbose logs, we can see: ??? console "Logs" ```text DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function DEBUG 03-07 03:06:54 [backends.py:370] Traced files (to be considered for compilation cache): DEBUG 03-07 03:06:54 [backends.py:370] xxx/torch/_dynamo/polyfills/builtins.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/torch/nn/modules/container.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/torch/nn/modules/module.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/attention/layer.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/distributed/communication_op.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/distributed/parallel_state.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/custom_op.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/layers/activation.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/layers/layernorm.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/layers/linear.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/layers/rotary_embedding.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/layers/vocab_parallel_embedding.py DEBUG 03-07 03:06:54 [backends.py:370] xxx/vllm/model_executor/models/llama.py DEBUG 03-07 03:07:07 [backends.py:462] Computation graph saved to ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/computation_graph.py DEBUG 03-07 03:07:07 [wrapper.py:105] Dynamo transformed code saved to ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/transformed_code.py ``` This is about the Python code compilation, i.e. graph capture by Dynamo. It tries to trace the function with code `xxx/vllm/model_executor/models/llama.py:339`, which is the `forward` function of the model we compile. During the forward pass, there are also other functions called and inlined by Dynamo, as shown by the logs, including some PyTorch functions from `xxx/torch/nn/modules/module.py` (used by PyTorch `nn.Module`, because module attribute access will trigger a function call), some communication / attention / activation functions from vLLM. All the traced files will be considered when we decide the cache directory to use. This way, any code change in the above files will trigger compilation cache miss, and therefore recompilation. The result of the Dynamo compilation, is a new function stored in `~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/transformed_code.py`. Usually, this function unpacks tensors from the module, and then pass it to the traced computation graph. The computation graph is stored in `~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/computation_graph.py`. ## Computation Graph Processing The computation graph has shape annotations for every tensor. The inputs are input ids, position ids, weights and buffers from the model, and the outputs are the final hidden states. Note that lm head projection and sampling operations are not considered in the graph. Most of the inputs to the computation graph has static shape, since they are model weights and buffers, and will not change during the lifetime of the model. Only the input ids and position ids have symbolic shapes, i.e. the shape can change from batch to batch. However, they will share the same symbolic shapes. That is to say, the only changing size to the computation graph, is the batch size (number of tokens processed in the current forward pass). The attention operation is complicated, and it needs to interact with kv caches, with complicated shapes. Fortunately, the output of the attention operation just share the same shape as the input query of the attention operation. Therefore, we wrap the whole attention operation into a PyTorch custom op `torch.ops.vllm.unified_attention_with_output`, so that Dynamo will not try to inspect any of the internal operations. This way, although attention operation is complicated, we can still capture the model's computation graph as a full-graph, from Dynamo's perspective. The computation graph is further split into pieces, by the `splitting_ops` (usually this is the attention operation). Therefore, in the `~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/computation_graph.py` file, we can see lots of submodules, each submodule is a piece of graph after splitting: - Attention operation itself is a submodule. - The part of computation graph, from one attention operation to the next attention operation, is a submodule. Every submodule can be identified by its index, and will be processed individually. ## Computation Graph Compilation In the very verbose logs, we can also see: ```console DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') ... DEBUG 03-07 03:52:45 [backends.py:134] store the 15-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') DEBUG 03-07 03:52:45 [backends.py:134] store the 16-th graph for shape None from inductor via handle ('fvj3ccoi7m34f3dnr4itmu55mmun44l5xymwhrjlwisylsk7q6jy', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/tf/ctfftkglj7b4lcttq5cymx6cew372uoauupqn6ldsvpiucavqcjc.py') ``` This means the first piece of computation graph (with shape `None` for symbolic shape) is compiled by Inductor (with a key `fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw`). The compiled kernel is stored in `~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py`. You can open the file to see what is the code Inductor finally runs. One more detail: you can see that the 1-th graph and the 15-th graph have the same key, while the 0-th graph and the 16-th graph are different. This is expected, since we split the graph by the attention op, we get 3 unique subgraphs: - the first layer before attention - every middle layer, from one attention operation to the next attention operation - the final layer after attention If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs: ```console DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') ``` This time, Inductor compilation is completely bypassed, and we will load from disk to read the compilation artifact we get from the last time. The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example: ```bash vllm serve meta-llama/Llama-3.2-1B \ --compilation_config '{"compile_sizes": [1, 2, 4, 8]}' ``` Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel. When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log: ??? console "Logs" ``` AUTOTUNE mm(8x2048, 2048x3072) triton_mm_4 0.0130 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2 triton_mm_8 0.0134 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_12 0.0148 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 mm 0.0160 ms 81.6% triton_mm_16 0.0165 ms 78.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8 triton_mm_3 0.0199 ms 65.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2 triton_mm_1 0.0203 ms 64.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=2 triton_mm_7 0.0203 ms 64.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 triton_mm_2 0.0208 ms 62.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_11 0.0215 ms 60.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 SingleProcess AUTOTUNE benchmarking takes 2.0428 seconds and 7.5727 seconds precompiling ``` It means, for a matrix multiplication with shape `8x2048x3072`, `torch.compile` tries triton template with various configs, and it is much faster than the default code (which dispatches to cublas library). Unfortunately, because auto-tuning takes quite a long time (from seconds to minutes, depending on the model size and the batch size), even though it can be cached for later use, for the sake of user-friendliness, we turn it off by default. If you want to have max performance, it is recommended to try it, by compiling specific shapes. ## Cudagraph Capture vLLM's V1 architecture uses piecewise cudagraph that aligns with the piecewise compilation. The full computation graph is split as mentioned above, and we only capture the cudagraph for the piece of graph between attention operations (including the first graph before any attention operation, and the last graph after all the attention operation). This is based on a common observation: computation between attentions are usually token-wise and easy to deal with for cudagraph; while the attention operation is non-trivial to be cudagraph compatible. Thus, by running the attention operation in eager mode while the rest operations in cudagraph, we keep the flexibility of the attention operation. The piecewise cudagraph also has fine-grained memory management. The purpose is to only exclude the attention kernel from cudagraph, while keeping all the rest modules and the memory allocation operations in the cudagraph. This is why the attention operation in V1 has the output tensor as the input of the attention. The cudagraphs are captured and managed by the compiler backend, and replayed when the batch size has corresponding cudagraph captured. The caller of the model (model runner) only needs to make sure it manages the input buffers correctly. All of the intermediate buffers are managed automatically by the compiler backend. By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`: ```bash vllm serve meta-llama/Llama-3.2-1B \ --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}' ``` Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture. ### Full Cudagraph capture It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models or MOEs. See [CUDA Graphs](cuda_graphs.md) for more details. --- # Automatic Prefix Caching ## Introduction Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. !!! note Technical details on how vLLM implements APC can be found [here](../design/prefix_caching.md). ## Enabling APC in vLLM Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example: [examples/offline_inference/automatic_prefix_caching.py](../../examples/offline_inference/automatic_prefix_caching.py) ## Example workloads We describe two example workloads, where APC can provide huge performance benefit: - Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency. - Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency. ## Limits APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused). --- # Batch Invariance !!! note Batch invariance is currently in beta. Some features are still under active development. Track progress and planned improvements at This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch. ## Motivation Batch invariance is crucial for several use cases: - **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching. - **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations. - **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training. - **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees. ## Hardware Requirements Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher: - **H-series**: H100, H200 - **B-series**: B100, B200 ## Enabling Batch Invariance Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`: ```bash export VLLM_BATCH_INVARIANT=1 ``` ### Online Inference (Server Mode) To start a vLLM server with batch invariance enabled: ```bash VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct ``` Then use the OpenAI-compatible client: ```python from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) # These requests will produce deterministic outputs # regardless of batch size or order response = client.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", prompt="The future of AI is", max_tokens=100, temperature=0.7, seed=42, ) print(response.choices[0].text) ``` ### Offline Inference For offline batch inference with batch invariance: ```python import os os.environ["VLLM_BATCH_INVARIANT"] = "1" from vllm import LLM, SamplingParams prompts = [ "The future of AI is", "Machine learning enables", "Deep learning models can", ] sampling_params = SamplingParams( temperature=0.7, top_p=0.95, max_tokens=100, seed=42, ) llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, ) # Outputs will be deterministic regardless of batch size outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}") print(f"Generated: {generated_text!r}\n") ``` ## Tested Models Batch invariance has been tested and verified on the following models: - **DeepSeek series**: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1` - **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B` - **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct` - **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct` Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose). ## Implementation Details When batch invariance is enabled, vLLM: 1. Uses deterministic kernel implementations for attention and other operations 2. Ensures consistent numerical behavior across different batch sizes 3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode) !!! note Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility. ## Future Improvements The batch invariance feature is under active development. Planned improvements include: - Support for additional GPU architectures - Expanded model coverage - Performance optimizations - Additional testing and validation For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433). --- # Custom Arguments You can use vLLM *custom arguments* to pass in arguments which are not part of the vLLM `SamplingParams` and REST API specifications. Adding or removing a vLLM custom argument does not require recompiling vLLM, since the custom arguments are passed in as a dictionary. Custom arguments can be useful if, for example, you want to use a [custom logits processor](./custom_logitsprocs.md) without modifying the vLLM source code. !!! note Make sure your custom logits processor have implemented `validate_params` for custom arguments. Otherwise, invalid custom arguments can cause unexpected behaviour. ## Offline Custom Arguments Custom arguments passed to `SamplingParams.extra_args` as a `dict` will be visible to any code which has access to `SamplingParams`: ``` python SamplingParams(extra_args={"your_custom_arg_name": 67}) ``` This allows arguments which are not already part of `SamplingParams` to be passed into `LLM` as part of a request. ## Online Custom Arguments The vLLM REST API allows custom arguments to be passed to the vLLM server via `vllm_xargs`. The example below integrates custom arguments into a vLLM REST API request: ``` bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", ... "vllm_xargs": {"your_custom_arg": 67} }' ``` Furthermore, OpenAI SDK users can access `vllm_xargs` via the `extra_body` argument: ``` python batch = await client.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", ..., extra_body={ "vllm_xargs": { "your_custom_arg": 67 } } ) ``` !!! note `vllm_xargs` is assigned to `SamplingParams.extra_args` under the hood, so code which uses `SamplingParams.extra_args` is compatible with both offline and online scenarios. --- # Custom Logits Processors !!! important Some logits processors design changes are still in progress and the API may change in the near future. We hope to stabilize this part of the API soon A "custom" logits processor is written by a user of vLLM and is loaded into vLLM at initialization without needing to modify or recompile the vLLM source code. It is the opposite of a built-in logits processor. This document shows how to write, load and use a custom logits processor. ## Logits Processors Background A logits processor adjusts the next-token probability distribution, usually with the intention of steering the model towards a desired type of behavior. In vLLM, logits processors operate at batch granularity. During a given engine step, the logits processor consumes a `(num_requests) x (vocab_size)` tensor of raw logits output by the model. For all requests which enable the logits processor, the logits processor applies a transformation to the corresponding row of the logits tensor, while leaving other rows unmodified. The transformed logits tensor is then passed to softmax. ## Creating a Custom Logits Processor Custom logits processors must subclass `vllm.v1.sample.logits_processor.LogitsProcessor` and define (at minimum) the following methods: * `validate_params(cls, sampling_params: SamplingParams)`: * Raise `ValueError` if `SamplingParams` has invalid arguments (especially custom arguments) used by logits processor. * When request is sent to entrypoint, `validate_params()` will validate `SamplingParams` and refuse request with invalid arguments. * **Note:** it's important to implement `validate_params()` to prevent invalid parameters for custom logits processor. Otherwise requests with invalid parameters can cause unexpected behaviour in custom logits processor. * `__init__(self, vllm_config: VllmConfig, device: torch.device, is_pin_memory: bool)` * `vllm_config`: engine configuration data structure * `device`: hardware accelerator device info * `is_pin_memory`: flag indicating whether pin memory is available to support logits processor implementation * `apply(self, logits: torch.Tensor) -> torch.Tensor`: * Consume a `(num_requests) x (vocab_size)` logits tensor (`logits`) * Apply logits processor transformation at batch granularity * Return a transformed `(num_requests) x (vocab_size)` logits tensor * You can modify the input logits processors in-place or out-of-place; in-place is more memory-efficient * `is_argmax_invariant(self) -> bool`: * Return `True` if the logits processor is argmax invariant (never changes what is the highest-logit-value token ID for a given request), `False` if the logits processor may modify argmax * `is_argmax_invariant()` is evaluated once at startup; if `True`, vLLM will skip applying this logits processor in a given step when all requests use greedy sampling * `update_state(self, batch_update: Optional["BatchUpdate"]) -> None`: * Consume a `BatchUpdate` data structure representing persistent batch state changes at the beginning of the current engine step * Use the `BatchUpdate` members to update logits processor internal state * **Note:** batch update data structure may be `None`, signaling no change to the batch constituents. In this case, the LogitsProcessor might still want to update its state based on the updated `output_token_ids` lists that it could have retained when they were added. ### How the vLLM engine builds the `BatchUpdate` data structure !!! important Some logits processors design changes are still in progress. We expect that in the future you will not need to account for batch state changes when implementing a logits processor, and the information in this section will become irrelevant. Logits processor `update_state()` implementations should assume the following model for how the model runner updates persistent batch state (expressed here in terms of the `BatchUpdate` abstraction): 1. Identify indices of requests which finished in the current engine step 2. Identify new requests introduced in the current step 3. Use Add operations to replace as many finished requests with new requests, in order of increasing index of the replaced request starting with the lowest index 4. Based on the relative number of new and finished requests: 1. If the numbers of new and finished requests are the same, proceed to next step 2. *If there are more new requests than finished requests:* apply Add operations to extend the batch with the remaining new requests which did not replace finished requests. Assign consecutive indices to these new requests, starting with `current_max_batch_index + 1` 3. *If there are fewer new requests than finished requests:* * Apply Remove operations to finished requests which were not replaced with new requests. These removed request indices will necessarily be greater than the greatest index of the finished requests which were replaced in the previous step. The Removes may leave the batch in a non-contiguous state * **"Condense" the batch to be contiguous:** starting with the lowest-index empty slot (which was caused by a Remove), apply a Unidirectional Move from the current highest non-empty slot in the batch to fill the empty slot. Proceed with additional Unidirectional Move operations in order of increasing empty slot destination index and decreasing non-empty slot source index until the batch is contiguous * **Shrink the batch:** a side effect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots 5. Reorder the batch for improved efficiency. Depending on the attention backend implementation and the current characteristics of the batch, zero or more Swap Move operations may be applied to reorder the batch Notes: * A logits processor `update_state()` method must process batch update operations in the following order: removes, adds, moves * The index argument for Add operations refers to the index *at the time the Add occurred*, i.e. before any Move operations * Example: if a request is Added at index 5 and then swapped with index 3, the Add operation in `BatchUpdate.added` will be associated with index 5 not 3 * In other words Move operations can be assumed to be applied after Adds and Removes * Move operations can be assumed to be applied in the order in which they appear in `BatchUpdate.moved` * If there are no new/finished requests and there is no batch reordering, then the batch update for the logits processors will be `None` ### Passing Custom Argument to a Custom Logits Processor Unlike built-in logits processors, custom logits processors may require configuration arguments that are not hard-coded into `SamplingParams` or the vLLM server REST API. To solve this problem, custom logits processors may leverage vLLM [custom arguments](./custom_arguments.md) support to receive configuration settings from the user (although you are also free to design a custom logits processor which utilizes the pre-existing fields in `SamplingParams`.) ### Example Custom Logits Processor Implementation The contrived example below implements a custom logits processor which consumes a `(num\_requests) \times (vocab\_size)` logits tensor and masks out all tokens except for one (`target_token`) with `float(-inf)`. The logits processor is disabled for any request that does not specify `target_token`. To determine whether the logits processor is enabled and which token to leave unmasked, the logits processor checks `SamplingParams.extra_args` for a `target_token` custom argument associated with each request: ??? code "Example custom logits processor definition" ``` python import torch from vllm.config import VllmConfig from vllm.sampling_params import SamplingParams from vllm.v1.sample.logits_processor import (BatchUpdate, LogitsProcessor, MoveDirectionality) class DummyLogitsProcessor(LogitsProcessor): """Fake logit processor to support unit testing and examples""" @classmethod def validate_params(cls, params: SamplingParams): target_token: int | None = params.extra_args and params.extra_args.get( "target_token" ) if target_token is not None and not isinstance(target_token, int): raise ValueError(f"target_token value {target_token} is not int") def __init__(self, vllm_config: "VllmConfig", device: torch.device, is_pin_memory: bool): self.req_info: dict[int, int] = {} def is_argmax_invariant(self) -> bool: """Never impacts greedy sampling""" return False def update_state(self, batch_update: BatchUpdate | None): if not batch_update: return # Process added requests. for index, params, _, _ in batch_update.added: assert params is not None self.validate_params(params) if params.extra_args and (target_token := params.extra_args.get("target_token")): self.req_info[index] = target_token else: self.req_info.pop(index, None) if self.req_info: # Process removed requests. for index in batch_update.removed: self.req_info.pop(index, None) # Process moved requests, unidirectional move (a->b) and swap # (a<->b) for adx, bdx, direct in batch_update.moved: a_val = self.req_info.pop(adx, None) b_val = self.req_info.pop(bdx, None) if a_val is not None: self.req_info[bdx] = a_val if direct == MoveDirectionality.SWAP and b_val is not None: self.req_info[adx] = b_val def apply(self, logits: torch.Tensor) -> torch.Tensor: if not self.req_info: return logits # Save target values before modification cols = torch.tensor( list(self.req_info.values()), dtype=torch.long, device=logits.device ) rows = torch.tensor( list(self.req_info.keys()), dtype=torch.long, device=logits.device ) values_to_keep = logits[rows, cols].clone() # Mask all but target tokens logits[rows] = float('-inf') logits[rows, cols] = values_to_keep return logits ``` In the rest of this document, we will use `DummyLogitsProcessor` as an example of a custom logits processor. The `DummyLogitsProcessor.update_state()` implementation maintains a "sparse" representation of the batched requests in the `self.req_info` dictionary: only those requests which specify a `target_token` value have a key in the dictionary. `update_state()` adjusts the stored request indices and `target_token` values (keys and values respectively in `self.req_info`) in response to Add, Remove and Move operations against the persistent batch. ### Wrapping an Existing Request-Level Logits Processor Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here][vllm.logits_process]) conforming to the following type annotation: ``` python RequestLogitsProcessor = Union[ # (output token ids, logits tensor) -> logits tensor Callable[[list[int], Tensor], Tensor], # (prompt token ids, output token ids, logits tensor) -> logits tensor Callable[[list[int], list[int], Tensor], Tensor], ] ``` While request-level logits processors are explicitly *not* supported in the vLLM engine, vLLM *does* provide a convenient process to wrap an existing `Callable` request-level logits processor and create a batch-level logits processor that is compatible with vLLM. The `Callable` must conform to the type annotation above; if your request-level logits processor has a different interface, then in order to wrap it, you may need to modify it or implement an additional wrapper layer to comply with the interface specification above. You can wrap the request-level logits processor by subclassing `AdapterLogitsProcessor` as shown in the example below (in this example, `DummyPerReqLogitsProcessor` is a stand-in for your request-level logits processor which needs to be wrapped.): * Override `AdapterLogitsProcessor.validate_params(cls,params)` to validate request's sampling parameters. * Override `AdapterLogitsProcessor.is_argmax_invariant(self)` to accurately reflect whether your request-level logits processor may impact which token has the highest-value logit. * Override `AdapterLogitsProcessor.new_req_logits_processor(self,params)` to create a new request-level logits processor instance from a `SamplingParams` instance: ??? code "Example of Wrapping a Request-Level Logits Processor" ``` python ... from vllm.v1.sample.logits_processor import ( AdapterLogitsProcessor, # Wrapper base-class RequestLogitsProcessor, # Request-level logitsproc type annotation ) ... # Stand-in for your request-level logits processor: class DummyPerReqLogitsProcessor: """The request-level logits processor masks out all logits except the token id identified by `target_token`""" def __init__(self, target_token: int) -> None: """Specify `target_token`""" self.target_token = target_token def __call__( self, output_ids: list[int], logits: torch.Tensor, ) -> torch.Tensor: val_to_keep = logits[self.target_token].item() logits[:] = float("-inf") logits[self.target_token] = val_to_keep return logits ... # Example of wrapping the request-level logits processor: class WrappedPerReqLogitsProcessor(AdapterLogitsProcessor): """Example of wrapping a fake request-level logit processor to create a batch-level logits processor""" @classmethod def validate_params(cls, params: SamplingParams): target_token: Any | None = params.extra_args and params.extra_args.get( "target_token" ) if target_token is not None and not isinstance(target_token, int): raise ValueError( f"target_token value {target_token} is not int" ) def is_argmax_invariant(self) -> bool: return False def new_req_logits_processor( self, params: SamplingParams, ) -> Optional[RequestLogitsProcessor]: """This method returns a new request-level logits processor, customized to the `target_token` value associated with a particular request. Returns None if the logits processor should not be applied to the particular request. To use the logits processor the request must have a "target_token" custom argument with an integer value. Args: params: per-request sampling params Returns: `Callable` request logits processor, or None """ target_token: Any | None = params.extra_args and params.extra_args.get( "target_token" ) if target_token is None: return None return DummyPerReqLogitsProcessor(target_token) ``` !!! note Your `new_req_logits_processor()` override can return `None` to signal that the wrapped logits processor should not be applied to the request in question. Once you have created a custom subclass (like `WrappedPerReqLogitsProcessor`) which wraps your request level logits processor, you can pass the custom subclass to vLLM via any of the methods described in the following section. ## Ways to Load Your Custom Logits Processor in vLLM Logits processors are loaded at initialization. Critically, the set of loaded logits processors cannot be modified after the vLLM engine finishes loading, and new logits processors cannot be loaded on-demand for individual requests. This section details different ways of making your logits processor visible to vLLM and triggering vLLM to load your logits processor. ### Method 1: Pass the Custom Logits Processor Fully-Qualified Class Name (FQCN) to vLLM at Initialization Time This method is supported in both offline and online vLLM usage scenarios. The custom logits processor's FQCN (in the form of `dotted.path.to.module:ClassName`) can be passed as an argument to the `LLM` and `AsyncLLM` Python constructors, or as a CLI argument to `vllm serve` with the following syntax ``` bash vllm serve ... --logits_processors ... ``` The only requirements on the FQCN are 1. Python's `importlib.import_module()` must be able to resolve the dotted path portion of the FQCN and load it as a module 2. The class-name portion of the FQCN must be possible to import from the loaded module 3. The object pointed to by the FQCN must be a subclass of `LogitsProcessor` See examples below: ??? code "Passing custom logits processor FQCN to `LLM` in Python" ``` python # Pass in FQCN llm = LLM( model="facebook/opt-125m", logits_processors=["your.module.path:DummyLogitsProcessor"], ) ``` ??? code "Passing custom logits processor FQCN to `AsyncLLM` in Python" ``` python # Pass in FQCN engine_args = AsyncEngineArgs(model="facebook/opt-125m", logits_processors=["your.module.path:DummyLogitsProcessor"]) async_llm = AsyncLLM.from_engine_args(engine_args) ``` ??? code "Passing custom logits processor FQCN to vLLM server via CLI" ```bash vllm serve facebook/opt-125m --logits_processors your.module.path:DummyLogitsProcessor ``` ### Method 2: Automatically Detect Custom Logits Processors Installed in Your Python Environment As Entry Points [`setuptools`](https://setuptools.pypa.io/en/latest/userguide/entry_point.html) can enable installed packages to make themselves available as plugins to other Python programs, via pieces of metadata known as "entry points". During initialization, vLLM automatically scans the `vllm.logits_processors` entry point group and loads any installed logits processors which it finds. Suppose that you have developed a Python package that holds your custom logits processors. You can expose each logits processor to vLLM by adding a unique entrypoint for each logits processor to your logits processor Python package. The example below shows how to add an entrypoint to your project's `pyproject.toml` file: ??? code "Exposing a custom logits processor as a Python entrypoint" ``` toml [project.entry-points."vllm.logits_processors"] dummy_logits_processor = "your.module.path:DummyLogitsProcessor" ``` Once your package is installed, your custom logits processor will be loaded automatically whenever vLLM is initialized. You do *not* need to pass the custom logits processor to the `LLM` or `AsyncLLM` constructors or to the vLLM server explicitly at initialization time if your logits processor is exposed as an entry point. !!! note vLLM will *always* load *all* logits processors which are exposed via entrypoints under the `vllm.logits_processors` grouping. ### Method 3 (Offline-only): Pass a Python Class Object to the vLLM Constructor You can pass one or more custom logits processor class objects to the `LLM` and `AsyncLLM` constructors. This option is very flexible, as the logits processor classes may either be (1) defined locally within the same Python source file where `LLM` or `AsyncLLM` is instantiated, or (2) imported from a Python package. ??? code "Passing custom logits processor class object to `LLM` or `AsyncLLM` in Python" ``` python # Import custom logits processor from some.module import DummyLogitsProcessor # ...or... # Define custom logits processor locally from vllm.v1.sample.logits_processor import LogitsProcessor class DummyLogitsProcessor(LogitsProcessor): # See DummyLogitsProcessor implementation above ... # Pass class object to LLM constructor llm = LLM( model="facebook/opt-125m", logits_processors=[DummyLogitsProcessor], ) # Pass class object to AsyncLLM constructor engine_args = AsyncEngineArgs(model="facebook/opt-125m", logits_processors=[DummyLogitsProcessor]) async_llm = AsyncLLM.from_engine_args(engine_args) ``` ## Invoking a Custom Logits Processor Against a Request The design of the custom logits processor determines whether the logits processor must be enabled/disabled for a given request, and what arguments must be provided to configure the logits processor. The examples below show how a user would pass a custom argument (`target_token`) to `DummyLogitsProcessor` in order to (1) enable the logits processor for that particular request and (2) control the logits processor's behavior. ??? code "vLLM REST API: configure custom logits processor for a request" ``` bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", ... "vllm_xargs": {"target_token": 67} }' ``` ??? code "OpenAI SDK: configure custom logits processor for a request" ``` python batch = await client.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", ..., extra_body={ "vllm_xargs": { "target_token": 67 } } ) ``` ??? code "Offline: configure custom logits processor for an `LLM` request" ``` python outputs_logitproc = llm.generate("your prompt", SamplingParams(..., extra_args={"target_token": 67})) ``` ??? code "Offline: configure custom logits processor for an `AsyncLLM` request" ``` python async for out in engine.generate(request_id="your request id", prompt="your prompt", sampling_params=SamplingParams(..., extra_args={"target_token": 67})): # Process async request outputs ... ``` ## Best Practices for Writing Custom Logits Processors Once vLLM loads a logits processor during initialization, then vLLM will invoke `update_state()` and `apply()` against that logits processor in every engine step. Both methods operate on all requests which currently reside in the vLLM persistent batch. Thus, it is important to implement these methods efficiently. * Write efficient `apply()` and `update_state()` implementations in light of the fact that logits processors operate at batch granularity * For example, you may be able to use efficient vectorized operations to implement `apply()` or update internal state vectors in `update_state()` * However, if you think that a logits processor may be used infrequently, it may be appropriate to use a "sparse" representation of request state i.e. the class can represent request configuration using a dictionary which only stores metadata about requests that enable the logits processor * **Note:** wrapped request-level logits processors do not need to implement `apply()` and `update_state()`; the default `AdapterLogitsProcessor.update_state()` implementation maintains a sparse representation of request state, wherein requests for which `new_req_logits_processor()` returns `None` are not represented in the base-class state dictionary. The default implementation of `AdapterLogitsProcessor.apply()` applies the request-level logits processor to each row of input logits sequentially and assembles the output logits tensor. If the performance of this `AdapterLogitsProcessor` default implementation is insufficient, then avoid wrapping your request-level logits processor and instead re-implement it as a `LogitsProcessor` subclass with optimized `apply()` and `update_state()` implementations that operate at batch granularity * It is up to the logits processor author to determine: 1. **The per-request attributes which configure the logits processor's behavior against that request.** Your custom logits processor's `update_state()` override determines how `SamplingParams` fields are mapped into logits processor state * **Note:** for wrapped request-level logits processors, `new_req_logits_processor()` determines how `SamplingParams` fields are used to initialize a request-level logits processor instance. 2. **The conditions under which the logits processor is or is not enabled on a per-request basis.** Unless your intention is for the custom logits processor to act on all requests all the time, you should write your logits processor in such a way that it is possible to disable the logits processor for a given request, i.e. by defaulting an argument to `None` or by passing in a specific do-nothing argument value i.e. `0.0`. Try to save compute and memory for requests which disable the logits processor * **Note:** for wrapped per-request logits processors, the default `AdapterLogitsProcessor.update_state()` implementation ensures that the request-level logits processor is disabled when `new_req_logits_processor()` returns `None` for that request 3. **The conditions under which the logits processor is short-circuited at the batch level.** Even if you have defined a way to disable the custom logits processor at the request level, it may be difficult to translate this into compute savings i.e. if your `update_state()` and `apply()` implementations use efficient vectorized implementations that operate on the whole persistent batch in a single command. For example, you cannot skip an entire vectorized operation in `apply()` just because one request disabled the logits processor. To save compute in the edge-case where no running requests utilize the custom logits processor, we recommend designing `apply()` to return the unmodified input tensor if all requests have the logits processor disabled. Similarly, consider whether steps can be skipped in `update_state()` if no requests enable the logits processor * Additionally, an easy way to save compute in `update_state()` is to exit early when the `batch_update` is `None` * **Note:** for wrapped per-request logits processors, the `AdapterLogitsProcessor` base-class implements the above optimizations by default * Ensure that the logits processor `update_state` method discards information about finished requests (i.e. requests which are replaced by an Add or which are subject to a Remove) * **Note:** for wrapped per-request logits processors, the `AdapterLogitsProcessor` base-class handles this by default * `is_argmax_invariant()` can be hard-coded to `True` or `False` if the logits processor has consistent behavior. However, the argmax invariance may also be determined programmatically (i.e. if your logits processor is user-customizable in some way that impacts whether the logits processor is argmax invariant). For this reason, `is_argmax_invariant()` is not a class method --- # Disaggregated Encoder A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits: 1. **Independent, fine-grained scaling** 2. **Lower time-to-first-token (TTFT)** 3. **Cross-process reuse and caching of encoder outputs** Design doc: --- ## 1 Motivation ### 1. Independent, fine-grained scaling * Vision encoders are lightweight, while language models are orders of magnitude larger. * The language model can be parallelised without affecting the encoder fleet. * Encoder nodes can be added or removed independently. ### 2. Lower time-to-first-token (TTFT) * Language-only requests bypass the vision encoder entirely. * Encoder output is injected only at required attention layers, shortening the pre-fill critical path. ### 3. Cross-process reuse and caching * In-process encoders confine reuse to a single worker. * A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation. --- ## 2 Usage Example The current reference pathway is **ExampleConnector**. Below ready-to-run scripts shows the workflow: 1 Encoder instance + 1 PD instance: `examples/online_serving/disaggregated_encoder/disagg_1e1pd_example.sh` 1 Encoder instance + 1 Prefill instance + 1 Decode instance: `examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh` --- ## 3 Test Script Please refer to the directories `tests/v1/ec_connector` ## 4 Development Disaggregated encoding is implemented by running two parts: * **Encoder instance** – a vLLM instance to performs vision encoding. * **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode. * PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D) A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance. All related code is under `vllm/distributed/ec_transfer`. ### Key abstractions * **ECConnector** – interface for retrieving EC caches produced by the encoder. * *Scheduler role* – checks cache existence and schedules loads. * *Worker role* – loads the embeddings into memory. Here is a figure illustrating disaggregate encoder flow: ![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png) For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance. `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0) We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D; --- # Disaggregated Prefilling (experimental) This page introduces you the disaggregated prefilling feature in vLLM. !!! note This feature is experimental and subject to change. ## Why disaggregated prefilling? Two main reasons: - **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT. - **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL. !!! note Disaggregated prefill DOES NOT improve throughput. ## Usage example Please refer to [examples/online_serving/disaggregated_prefill.sh](../../examples/online_serving/disaggregated_prefill.sh) for the example usage of disaggregated prefilling. Now supports 5 types of connectors: - **ExampleConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of ExampleConnector disaggregated prefilling. - **LMCacheConnectorV1**: refer to [examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission. - **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md). - **P2pNcclConnector**: refer to [examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling. - **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as: ```bash --kv-transfer-config '{"kv_connector":"MultiConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"NixlConnector","kv_role":"kv_both"},{"kv_connector":"ExampleConnector","kv_role":"kv_both","kv_connector_extra_config":{"shared_storage_path":"local_storage"}}]}}' ``` For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as: ```bash --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_buffer_device":"cuda", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}' ``` - **OffloadingConnector**: enable offloading of KV data to CPU memory, customizing the CPU block size (in tokens) and number of blocks to allocate (per worker): ```bash --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "num_cpu_blocks": 1000}}' ``` ## Benchmarks Please refer to [benchmarks/disagg_benchmarks](../../benchmarks/disagg_benchmarks) for disaggregated prefilling benchmarks. ## Development We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance. All disaggregated prefilling implementation is under `vllm/distributed/kv_transfer`. Key abstractions for disaggregated prefilling: - **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**. - **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer. - **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`. !!! note `insert` is non-blocking operation but `drop_select` is blocking operation. Here is a figure illustrating how the above 3 abstractions are organized: ![Disaggregated prefilling abstractions](../assets/features/disagg_prefill/abstraction.jpg) The workflow of disaggregated prefilling is as follows: ![Disaggregated prefilling workflow](../assets/features/disagg_prefill/overview.jpg) The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer. Now every process in vLLM will have a corresponding connector. Specifically, we have: - Scheduler connector: the connector that locates in the same process as the scheduler process. It schedules the KV cache transfer ops. - Worker connectors: the connectors that locate in the worker processes. They execute KV cache transfer ops. Here is a figure illustrating how the above 2 connectors are organized: ![Disaggregated prefilling high level design](../assets/features/disagg_prefill/high_level_design.png) The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load: ![Disaggregated prefilling workflow](../assets/features/disagg_prefill/workflow.png) ## Third-party contributions Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors). We recommend three ways of implementations: - **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc.). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions. - **Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL. - **Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`. --- # Interleaved Thinking ## Introduction Interleaved thinking allows models to reason between tool calls, enabling more sophisticated decision-making after receiving tool results. This feature helps models chain multiple tool calls with reasoning steps in between and make nuanced decisions based on intermediate results. Important: Interleaved thinking increases token usage and response latency. Consider your budget and performance requirements when enabling this feature. ## How Interleaved Thinking Works With interleaved thinking, the model can: - Reason about the results of a tool call before deciding what to do next - Chain multiple tool calls with reasoning steps in between - Make more nuanced decisions based on intermediate results - Provide transparent reasoning for its tool selection process ## Supported Models vLLM currently supports the following interleaved thinking models: | Model Series | Reasoning Parser Name | |--------------|-----------------------| | moonshotai/Kimi-K2-Thinking | kimi_k2 | | MiniMaxAI/MiniMax-M2 | minimax_m2 | ## Example Usage To use interleaved thinking with tool calls, specify a model that supports this feature and enable tool calls in your chat completion request. Here's an example: ??? code ```python """ vllm serve MiniMaxAI/MiniMax-M2 \ --tensor-parallel-size 4 \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2 \ --enable-auto-tool-choice """ import json from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") def get_current_weather(location: str, unit: "str"): """Get the current weather in a given location""" if unit == "celsius": return f"The current temperature in {location} is 22°C." else: return f"The current temperature in {location} is 72°F." tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and state, e.g., 'San Francisco, CA'", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location", "unit"], }, }, } ] messages = [{"role": "user", "content": "What's the weather in Fahrenheit like in San Francisco?"}] response = client.chat.completions.create( model=client.models.list().data[0].id, messages=messages, tools=tools, tool_choice="auto", ) tool_call = response.choices[0].message.tool_calls[0].function messages.append( { "role": "assistant", "tool_calls": response.choices[0].message.tool_calls, "reasoning": response.choices[0].message.reasoning, # append reasoning } ) # Simulate tool execution available_tools = {"get_weather": get_current_weather} completion_tool_calls = response.choices[0].message.tool_calls for call in completion_tool_calls: tool_to_call = available_tools[call.function.name] args = json.loads(call.function.arguments) result = tool_to_call(**args) messages.append( { "role": "tool", "content": result, "tool_call_id": call.id, "name": call.function.name, } ) response_2 = client.chat.completions.create( model=client.models.list().data[0].id, messages=messages, tools=tools, tool_choice="auto", ) print(response_2.choices[0].message.content) ``` This example demonstrates how to set up interleaved thinking with tool calls using a weather retrieval function. The model reasons about the tool results before generating the final response. --- # LoRA Adapters This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA]. Adapters can be efficiently served on a per-request basis with minimal overhead. First we download the adapter(s) and save them locally with ```python from huggingface_hub import snapshot_download sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test") ``` Then we instantiate the base model and pass in the `enable_lora=True` flag: ```python from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True) ``` We can now submit the prompts and call `llm.generate` with the `lora_request` parameter. The first parameter of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third parameter is the path to the LoRA adapter. ??? code ```python sampling_params = SamplingParams( temperature=0, max_tokens=256, stop=["[/assistant]"], ) prompts = [ "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", ] outputs = llm.generate( prompts, sampling_params, lora_request=LoRARequest("sql_adapter", 1, sql_lora_path), ) ``` Check out [examples/offline_inference/multilora_inference.py](../../examples/offline_inference/multilora_inference.py) for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options. ## Serving LoRA Adapters LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use `--lora-modules {name}={path} {name}={path}` to specify each LoRA module when we kick off the server: ```bash vllm serve meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ ``` !!! note The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one. The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`, etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.): ??? console "Command" ```bash curl localhost:8000/v1/models | jq . { "object": "list", "data": [ { "id": "meta-llama/Llama-2-7b-hf", "object": "model", ... }, { "id": "sql-lora", "object": "model", ... } ] } ``` Requests can specify the LoRA adapter as if it were any other model via the `model` request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other LoRA adapter requests if they were provided and `max_loras` is set high enough). The following is an example request ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "sql-lora", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' | jq ``` ## Dynamically serving LoRA Adapters In addition to serving LoRA adapters at server startup, the vLLM server supports dynamically configuring LoRA adapters at runtime through dedicated API endpoints and plugins. This feature can be particularly useful when the flexibility to change models on-the-fly is needed. Note: Enabling this feature in production environments is risky as users may participate in model adapter management. To enable dynamic LoRA configuration, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING` is set to `True`. ```bash export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True ``` ### Using API Endpoints Loading a LoRA Adapter: To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter. Example request to load a LoRA adapter: ```bash curl -X POST http://localhost:8000/v1/load_lora_adapter \ -H "Content-Type: application/json" \ -d '{ "lora_name": "sql_adapter", "lora_path": "/path/to/sql-lora-adapter" }' ``` Upon a successful request, the API will respond with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' added successfully`. If an error occurs, such as if the adapter cannot be found or loaded, an appropriate error message will be returned. Unloading a LoRA Adapter: To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint with the name or ID of the adapter to be unloaded. Upon a successful request, the API responds with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' removed successfully`. Example request to unload a LoRA adapter: ```bash curl -X POST http://localhost:8000/v1/unload_lora_adapter \ -H "Content-Type: application/json" \ -d '{ "lora_name": "sql_adapter" }' ``` ### Using Plugins Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter. You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds. You can either install existing plugins or implement your own. By default, vLLM comes with a [resolver plugin to load LoRA adapters from a local directory.](https://github.com/vllm-project/vllm/tree/main/vllm/plugins/lora_resolvers) To enable this resolver, set `VLLM_ALLOW_RUNTIME_LORA_UPDATING` to True, set `VLLM_PLUGINS` to include `lora_filesystem_resolver`, and then set `VLLM_LORA_RESOLVER_CACHE_DIR` to a local directory. When vLLM receives a request using a LoRA adapter `foobar`, it will first look in the local directory for a directory `foobar`, and attempt to load the contents of that directory as a LoRA adapter. If successful, the request will complete as normal and that adapter will then be available for normal use on the server. Alternatively, follow these example steps to implement your own plugin: 1. Implement the LoRAResolver interface. ??? code "Example of a simple S3 LoRAResolver implementation" ```python import os import s3fs from vllm.lora.request import LoRARequest from vllm.lora.resolver import LoRAResolver class S3LoRAResolver(LoRAResolver): def __init__(self): self.s3 = s3fs.S3FileSystem() self.s3_path_format = os.getenv("S3_PATH_TEMPLATE") self.local_path_format = os.getenv("LOCAL_PATH_TEMPLATE") async def resolve_lora(self, base_model_name, lora_name): s3_path = self.s3_path_format.format(base_model_name=base_model_name, lora_name=lora_name) local_path = self.local_path_format.format(base_model_name=base_model_name, lora_name=lora_name) # Download the LoRA from S3 to the local path await self.s3._get( s3_path, local_path, recursive=True, maxdepth=1 ) lora_request = LoRARequest( lora_name=lora_name, lora_path=local_path, lora_int_id=abs(hash(lora_name)), ) return lora_request ``` 2. Register `LoRAResolver` plugin. ```python from vllm.lora.resolver import LoRAResolverRegistry s3_resolver = S3LoRAResolver() LoRAResolverRegistry.register_resolver("s3_resolver", s3_resolver) ``` For more details, refer to the [vLLM's Plugins System](../design/plugin_system.md). ## New format for `--lora-modules` In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example: ```bash --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ ``` This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`. Now, you can specify a base_model_name alongside the name and path using JSON format. For example: ```bash --lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}' ``` To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case. ## LoRA model lineage in model card The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this: - The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter. - The `root` field points to the artifact location of the lora adapter. ??? console "Command output" ```bash $ curl http://localhost:8000/v1/models { "object": "list", "data": [ { "id": "meta-llama/Llama-2-7b-hf", "object": "model", "created": 1715644056, "owned_by": "vllm", "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/", "parent": null, "permission": [ { ..... } ] }, { "id": "sql-lora", "object": "model", "created": 1715644056, "owned_by": "vllm", "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/", "parent": meta-llama/Llama-2-7b-hf, "permission": [ { .... } ] } ] } ``` ## LoRA Support for Tower and Connector of Multi-Modal Model Currently, vLLM experimentally supports LoRA for the Tower and Connector components of multi-modal models. To enable this feature, you need to implement the corresponding token helper functions for the tower and connector. For more details on the rationale behind this approach, please refer to [PR 26674](https://github.com/vllm-project/vllm/pull/26674). We welcome contributions to extend LoRA support to additional models' tower and connector. ## Default LoRA Models For Multimodal Models Some models, e.g., [Granite Speech](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) multimodal, contain LoRA adapter(s) that are expected to always be applied when a given modality is present. This can be a bit tedious to manage with the above approaches, as it requires the user to send the `LoRARequest` (offline) or to filter requests between the base model and LoRA model (server) depending on the content of the request's multimodal data. To this end, we allow registration of default multimodal LoRAs to handle this automatically, where users can map each modality to a LoRA adapter to automatically apply it when the corresponding inputs are present. Note that currently, we only allow one LoRA per prompt; if several modalities are provided, each of which are registered to a given modality, none of them will be applied. ??? code "Example usage for offline inference" ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams from vllm.assets.audio import AudioAsset model_id = "ibm-granite/granite-speech-3.3-2b" tokenizer = AutoTokenizer.from_pretrained(model_id) def get_prompt(question: str, has_audio: bool): """Build the input prompt to send to vLLM.""" if has_audio: question = f"<|audio|>{question}" chat = [ {"role": "user", "content": question}, ] return tokenizer.apply_chat_template(chat, tokenize=False) llm = LLM( model=model_id, enable_lora=True, max_lora_rank=64, max_model_len=2048, limit_mm_per_prompt={"audio": 1}, # Will always pass a `LoRARequest` with the `model_id` # whenever audio is contained in the request data. default_mm_loras = {"audio": model_id}, enforce_eager=True, ) question = "can you transcribe the speech into a written format?" prompt_with_audio = get_prompt( question=question, has_audio=True, ) audio = AudioAsset("mary_had_lamb").audio_and_sample_rate inputs = { "prompt": prompt_with_audio, "multi_modal_data": { "audio": audio, } } outputs = llm.generate( inputs, sampling_params=SamplingParams( temperature=0.2, max_tokens=64, ), ) ``` You can also pass a json dictionary of `--default-mm-loras` mapping modalities to LoRA model IDs. For example, when starting the server: ```bash vllm serve ibm-granite/granite-speech-3.3-2b \ --max-model-len 2048 \ --enable-lora \ --default-mm-loras '{"audio":"ibm-granite/granite-speech-3.3-2b"}' \ --max-lora-rank 64 ``` Note: Default multimodal LoRAs are currently only available for `.generate` and chat completions. ## Using Tips ### Configuring `max_lora_rank` The `--max-lora-rank` parameter controls the maximum rank allowed for LoRA adapters. This setting affects memory allocation and performance: - **Set it to the maximum rank** among all LoRA adapters you plan to use - **Avoid setting it too high** - using a value much larger than needed wastes memory and can cause performance issues For example, if your LoRA adapters have ranks [16, 32, 64], use `--max-lora-rank 64` rather than 256 ```bash # Good: matches actual maximum rank vllm serve model --enable-lora --max-lora-rank 64 # Bad: unnecessarily high, wastes memory vllm serve model --enable-lora --max-lora-rank 256 ``` --- # MooncakeConnector Usage Guide ## About Mooncake Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage environments, by constructing a multi-level caching pool on high-speed interconnected DRAM/SSD resources. Compared to traditional caching systems, Mooncake utilizes (GPUDirect) RDMA technology to transfer data directly in a zero-copy manner, while maximizing the use of multi-NIC resources on a single machine. For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/). ## Prerequisites ### Installation Install mooncake through pip: `uv pip install mooncake-transfer-engine`. Refer to [Mooncake official repository](https://github.com/kvcache-ai/Mooncake) for more installation instructions ## Usage ### Prefiller Node (192.168.0.2) ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --port 8010 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' ``` ### Decoder Node (192.168.0.3) ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --port 8020 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' ``` ### Proxy ```bash python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-host 192.168.0.2 --prefiller-port 8010 --decoder-host 192.168.0.3 --decoder-port 8020 ``` > NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future. Now you can send requests to the proxy server through port 8000. ## Environment Variables - `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server - Default: 8998 - Required only for prefiller instances - Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine - For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank - Used for the decoder notifying the prefiller - `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional) - Default: 480 - If a request is aborted and the decoder has not yet notified the prefiller, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely. ## KV Role Options - **kv_producer**: For prefiller instances that generate KV caches - **kv_consumer**: For decoder instances that consume KV caches from prefiller - **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined. --- # Multimodal Inputs This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-multimodal-language-models) in vLLM. !!! note We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes, and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests. !!! tip When serving multi-modal models, consider setting `--allowed-media-domains` to restrict domain that vLLM can access to prevent it from accessing arbitrary endpoints that can potentially be vulnerable to Server-Side Request Forgery (SSRF) attacks. You can provide a list of domains for this arg. For example: `--allowed-media-domains upload.wikimedia.org github.com www.bogotobogo.com` Also, consider setting `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to prevent HTTP redirects from being followed to bypass domain restrictions. This restriction is especially important if you run vLLM in a containerized environment where the vLLM pods may have unrestricted access to internal networks. ## Offline Inference To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]: - `prompt`: The prompt should follow the format that is documented on HuggingFace. - `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][]. ### Stable UUIDs for Caching (multi_modal_uuids) When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content. ??? code ```python from vllm import LLM from PIL import Image # Qwen2.5-VL example with two images llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct") prompt = "USER: \nDescribe the differences.\nASSISTANT:" img_a = Image.open("/path/to/a.jpg") img_b = Image.open("/path/to/b.jpg") outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": [img_a, img_b]}, # Provide stable IDs for caching. # Requirements (matched by this example): # - Include every modality present in multi_modal_data. # - For lists, provide the same number of entries. # - Use None to fall back to content hashing for that item. "multi_modal_uuids": {"image": ["sku-1234-a", None]}, }) for o in outputs: print(o.outputs[0].text) ``` Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache. ??? code ```python from vllm import LLM from PIL import Image # Qwen2.5-VL example with two images llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct") prompt = "USER: \nDescribe the differences.\nASSISTANT:" img_b = Image.open("/path/to/b.jpg") outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": [None, img_b]}, # Since img_a is expected to be cached, we can skip sending the actual # image entirely. "multi_modal_uuids": {"image": ["sku-1234-a", None]}, }) for o in outputs: print(o.outputs[0].text) ``` !!! warning If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored. ### Image Inputs You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples: ??? code ```python from vllm import LLM llm = LLM(model="llava-hf/llava-1.5-7b-hf") # Refer to the HuggingFace repo for the correct format to use prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" # Load the image using PIL.Image image = PIL.Image.open(...) # Single prompt inference outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": image}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) # Batch inference image_1 = PIL.Image.open(...) image_2 = PIL.Image.open(...) outputs = llm.generate( [ { "prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": image_1}, }, { "prompt": "USER: \nWhat's the color of this image?\nASSISTANT:", "multi_modal_data": {"image": image_2}, } ] ) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py) To substitute multiple images inside the same text prompt, you can pass in a list of images instead: ??? code ```python from vllm import LLM llm = LLM( model="microsoft/Phi-3.5-vision-instruct", trust_remote_code=True, # Required to load Phi-3.5-vision max_model_len=4096, # Otherwise, it may not fit in smaller GPUs limit_mm_per_prompt={"image": 2}, # The maximum number to accept ) # Refer to the HuggingFace repo for the correct format to use prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" # Load the images using PIL.Image image1 = PIL.Image.open(...) image2 = PIL.Image.open(...) outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": [image1, image2]}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Full example: [examples/offline_inference/vision_language_multi_image.py](../../examples/offline_inference/vision_language_multi_image.py) If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: ```python from vllm import LLM from vllm.assets.image import ImageAsset llm = LLM(model="llava-hf/llava-1.5-7b-hf") image_url = "https://picsum.photos/id/32/512/512" image_pil = ImageAsset('cherry_blossom').pil_image image_embeds = torch.load(...) conversation = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hello! How can I assist you today?"}, { "role": "user", "content": [ { "type": "image_url", "image_url": {"url": image_url}, }, { "type": "image_pil", "image_pil": image_pil, }, { "type": "image_embeds", "image_embeds": image_embeds, }, { "type": "text", "text": "What's in these images?", }, ], }, ] # Perform inference and log output. outputs = llm.chat(conversation) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: ??? code ```python from vllm import LLM # Specify the maximum number of frames per video to be 4. This can be changed. llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) # Create the request payload. video_frames = ... # load your video making sure it only has the number of frames specified earlier. message = { "role": "user", "content": [ { "type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video.", }, ], } for i in range(len(video_frames)): base64_image = encode_image(video_frames[i]) # base64 encoding. new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}} message["content"].append(new_image) # Perform inference and log output. outputs = llm.chat([message]) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` #### Custom RGBA Background Color When loading RGBA images (images with transparency), vLLM converts them to RGB format. By default, transparent pixels are replaced with white background. You can customize this background color using the `rgba_background_color` parameter in `media_io_kwargs`. ??? code ```python from vllm import LLM # Default white background (no configuration needed) llm = LLM(model="llava-hf/llava-1.5-7b-hf") # Custom black background for dark theme llm = LLM( model="llava-hf/llava-1.5-7b-hf", media_io_kwargs={"image": {"rgba_background_color": [0, 0, 0]}}, ) # Custom brand color background (e.g., blue) llm = LLM( model="llava-hf/llava-1.5-7b-hf", media_io_kwargs={"image": {"rgba_background_color": [0, 0, 255]}}, ) ``` !!! note - The `rgba_background_color` accepts RGB values as a list `[R, G, B]` or tuple `(R, G, B)` where each value is 0-255 - This setting only affects RGBA images with transparency; RGB images are unchanged - If not specified, the default white background `(255, 255, 255)` is used for backward compatibility ### Video Inputs You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary instead of using multi-image input. Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown in this example using Qwen2.5-VL: ??? code ```python from transformers import AutoProcessor from vllm import LLM, SamplingParams from qwen_vl_utils import process_vision_info model_path = "Qwen/Qwen2.5-VL-3B-Instruct" video_path = "https://content.pexels.com/videos/free-videos.mp4" llm = LLM( model=model_path, gpu_memory_utilization=0.8, enforce_eager=True, limit_mm_per_prompt={"video": 1}, ) sampling_params = SamplingParams(max_tokens=1024) video_messages = [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": [ {"type": "text", "text": "describe this video."}, { "type": "video", "video": video_path, "total_pixels": 20480 * 28 * 28, "min_pixels": 16 * 28 * 28, }, ] }, ] messages = video_messages processor = AutoProcessor.from_pretrained(model_path) prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = process_vision_info(messages) mm_data = {} if video_inputs is not None: mm_data["video"] = video_inputs llm_inputs = { "prompt": prompt, "multi_modal_data": mm_data, } outputs = llm.generate([llm_inputs], sampling_params=sampling_params) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` !!! note 'process_vision_info' is only applicable to Qwen2.5-VL and similar models. Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py) ### Audio Inputs You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py) ### Embedding Inputs To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. You must enable this feature via `enable_mm_embeds=True`. !!! warning The vLLM engine may crash if incorrect shape of embeddings is passed. Only enable this flag for trusted users! #### Image Embeddings ??? code ```python from vllm import LLM # Inference with image embeddings as input llm = LLM(model="llava-hf/llava-1.5-7b-hf", enable_mm_embeds=True) # Refer to the HuggingFace repo for the correct format to use prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" # Embeddings for single image # torch.Tensor of shape (1, image_feature_size, hidden_size of LM) image_embeds = torch.load(...) outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": image_embeds}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings: ??? code ```python # Construct the prompt based on your model prompt = ... # Embeddings for multiple images # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM) image_embeds = torch.load(...) # Qwen2-VL llm = LLM( "Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}, enable_mm_embeds=True, ) mm_data = { "image": { "image_embeds": image_embeds, # image_grid_thw is needed to calculate positional encoding. "image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3), } } # MiniCPM-V llm = LLM( "openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4}, enable_mm_embeds=True, ) mm_data = { "image": { "image_embeds": image_embeds, # image_sizes is needed to calculate details of the sliced image. "image_sizes": [image.size for image in images], # list of image sizes } } outputs = llm.generate({ "prompt": prompt, "multi_modal_data": mm_data, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features. #### Audio Embedding Inputs You can pass pre-computed audio embeddings similar to image embeddings: ??? code ```python from vllm import LLM import torch # Enable audio embeddings support llm = LLM(model="fixie-ai/ultravox-v0_5-llama-3_2-1b", enable_mm_embeds=True) # Refer to the HuggingFace repo for the correct format to use prompt = "USER:
modeling_my_model.py ```python from transformers import PreTrainedModel from torch import nn class MyAttention(nn.Module): is_causal = False # Only do this for encoder-only models def forward(self, hidden_states, **kwargs): ... attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] attn_output, attn_weights = attention_interface( self, query_states, key_states, value_states, **kwargs, ) ... # Only do this for mixture-of-experts models class MyExperts(nn.ModuleList): def forward(self, hidden_states, top_k_index, top_k_weights): ... # Only do this for mixture-of-experts models class MySparseMoEBlock(nn.Module): def __init__(self, config): ... self.experts = MyExperts(config) ... def forward(self, hidden_states: torch.Tensor): ... hidden_states = self.experts(hidden_states, top_k_index, top_k_weights) ... class MyModel(PreTrainedModel): _supports_attention_backend = True ```
Here is what happens in the background when this model is loaded: 1. The config is loaded. 2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`. 3. `MyModel` is loaded into one of the Transformers modeling backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used. That's it! For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class: configuration_my_model.py ```python from transformers import PretrainedConfig class MyConfig(PretrainedConfig): base_model_tp_plan = { "layers.*.self_attn.k_proj": "colwise", "layers.*.self_attn.v_proj": "colwise", "layers.*.self_attn.o_proj": "rowwise", "layers.*.mlp.gate_proj": "colwise", "layers.*.mlp.up_proj": "colwise", "layers.*.mlp.down_proj": "rowwise", } base_model_pp_plan = { "embed_tokens": (["input_ids"], ["inputs_embeds"]), "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), "norm": (["hidden_states"], ["hidden_states"]), } ``` - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported). - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s: - You only need to do this for layers which are not present on all pipeline stages - vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages - The `list` in the first element of the `tuple` contains the names of the input arguments - The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code ## Loading a Model ### Hugging Face Hub By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models). To change the download path for models, you can set the `HF_HOME` environment variable; for more details, refer to [their official documentation](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome). To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository. If the `"architectures"` field contains a model architecture listed below, then it should be natively supported. Models do not _need_ to be natively supported to be used in vLLM. The [Transformers modeling backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!). !!! tip The easiest way to check if your model is really supported at runtime is to run the program below: ```python from vllm import LLM # For generative models (runner=generate) only llm = LLM(model=..., runner="generate") # Name or path of your model output = llm.generate("Hello, my name is") print(output) # For pooling models (runner=pooling) only llm = LLM(model=..., runner="pooling") # Name or path of your model output = llm.encode("Hello, my name is") print(output) ``` If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported. Otherwise, please refer to [Adding a New Model](../contributing/model/README.md) for instructions on how to implement your model in vLLM. Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support. #### Download a model If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository: ```bash # Download a model huggingface-cli download HuggingFaceH4/zephyr-7b-beta # Specify a custom cache directory huggingface-cli download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache # Download a specific file from a model repo huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json ``` #### List the downloaded models Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache: ```bash # List cached models huggingface-cli scan-cache # Show detailed (verbose) output huggingface-cli scan-cache -v # Specify a custom cache directory huggingface-cli scan-cache --dir ~/.cache/huggingface/hub ``` #### Delete a cached model Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache: Commands ```console # The `delete-cache` command requires extra dependencies to work with the TUI. # Please run `pip install huggingface_hub[cli]` to install them. # Launch the interactive TUI to select models to delete $ huggingface-cli delete-cache ? Select revisions to delete: 1 revisions selected counting for 438.9M. ○ None of the following (if selected, nothing will be deleted). Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago) ❯ ◉ a5beb1e3: main # modified 1 week ago Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago) ○ d4aa6901: main # modified 1 week ago Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago) ○ 2cfc18c9: main # modified 4 weeks ago Press to select, to validate and to quit without modification. # Need to confirm after selected ? Select revisions to delete: 1 revision(s) selected. ? 1 revisions selected counting for 438.9M. Confirm deletion ? Yes Start deletion. Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M. ``` #### Using a proxy Here are some tips for loading/downloading models from Hugging Face using a proxy: - Set the proxy globally for your session (or set it in the profile file): ```shell export http_proxy=http://your.proxy.server:port export https_proxy=http://your.proxy.server:port ``` - Set the proxy for just the current command: ```shell https_proxy=http://your.proxy.server:port huggingface-cli download # or use vllm cmd directly https_proxy=http://your.proxy.server:port vllm serve ``` - Set the proxy in Python interpreter: ```python import os os.environ["http_proxy"] = "http://your.proxy.server:port" os.environ["https_proxy"] = "http://your.proxy.server:port" ``` ### ModelScope To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable: ```shell export VLLM_USE_MODELSCOPE=True ``` And use with `trust_remote_code=True`. ```python from vllm import LLM llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True) # For generative models (runner=generate) only output = llm.generate("Hello, my name is") print(output) # For pooling models (runner=pooling) only output = llm.encode("Hello, my name is") print(output) ``` ## Feature Status Legend - ✅︎ indicates that the feature is supported for the model. - 🚧 indicates that the feature is planned but not yet supported for the model. - ⚠️ indicates that the feature is available but may have known issues or limitations. ## List of Text-only Language Models ### Generative Models See [this page](generative_models.md) for more information on how to use generative models. #### Text Generation These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `AfmoeForCausalLM` | Afmoe | TBA | ✅︎ | ✅︎ | | `ApertusForCausalLM` | Apertus | `swiss-ai/Apertus-8B-2509`, `swiss-ai/Apertus-70B-Instruct-2509`, etc. | ✅︎ | ✅︎ | | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | | `ArceeForCausalLM` | Arcee (AFM) | `arcee-ai/AFM-4.5B-Base`, etc. | ✅︎ | ✅︎ | | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | | `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | | `BailingMoeForCausalLM` | Ling | `inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, etc. | ✅︎ | ✅︎ | | `BailingMoeV2ForCausalLM` | Ling | `inclusionAI/Ling-mini-2.0`, etc. | ✅︎ | ✅︎ | | `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | | `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | | `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `zai-org/chatglm2-6b`, `zai-org/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ | | `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R, Command-A | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025`, `CohereLabs/command-a-reasoning-08-2025`, etc. | ✅︎ | ✅︎ | | `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | | ✅︎ | | `DeciLMForCausalLM` | DeciLM | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc. | ✅︎ | ✅︎ | | `DeepseekForCausalLM` | DeepSeek | `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat`, etc. | ✅︎ | ✅︎ | | `DeepseekV2ForCausalLM` | DeepSeek-V2 | `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat`, etc. | ✅︎ | ✅︎ | | `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`, etc. | ✅︎ | ✅︎ | | `Dots1ForCausalLM` | dots.llm1 | `rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, etc. | | ✅︎ | | `DotsOCRForCausalLM` | dots_ocr | `rednote-hilab/dots.ocr` | | ✅︎ | | `Ernie4_5ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | | `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. |✅︎| ✅︎ | | `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | | `Exaone4ForCausalLM` | EXAONE-4 | `LGAI-EXAONE/EXAONE-4.0-32B`, etc. | ✅︎ | ✅︎ | | `Fairseq2LlamaForCausalLM` | Llama (fairseq2 format) | `mgleize/fairseq2-dummy-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ | | `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | | `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | | ✅︎ | | `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ | | `FlexOlmoForCausalLM` | FlexOlmo | `allenai/FlexOlmo-7x7B-1T`, `allenai/FlexOlmo-7x7B-1T-RT`, etc. | | ✅︎ | | `GemmaForCausalLM` | Gemma | `google/gemma-2b`, `google/gemma-1.1-2b-it`, etc. | ✅︎ | ✅︎ | | `Gemma2ForCausalLM` | Gemma 2 | `google/gemma-2-9b`, `google/gemma-2-27b`, etc. | ✅︎ | ✅︎ | | `Gemma3ForCausalLM` | Gemma 3 | `google/gemma-3-1b-it`, etc. | ✅︎ | ✅︎ | | `Gemma3nForCausalLM` | Gemma 3n | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | | `GlmForCausalLM` | GLM-4 | `zai-org/glm-4-9b-chat-hf`, etc. | ✅︎ | ✅︎ | | `Glm4ForCausalLM` | GLM-4-0414 | `zai-org/GLM-4-32B-0414`, etc. | ✅︎ | ✅︎ | | `Glm4MoeForCausalLM` | GLM-4.5, GLM-4.6, GLM-4.7 | `zai-org/GLM-4.5`, etc. | ✅︎ | ✅︎ | | `GPT2LMHeadModel` | GPT-2 | `gpt2`, `gpt2-xl`, etc. | | ✅︎ | | `GPTBigCodeForCausalLM` | StarCoder, SantaCoder, WizardCoder | `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc. | ✅︎ | ✅︎ | | `GPTJForCausalLM` | GPT-J | `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc. | | ✅︎ | | `GPTNeoXForCausalLM` | GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM | `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc. | | ✅︎ | | `GptOssForCausalLM` | GPT-OSS | `openai/gpt-oss-120b`, `openai/gpt-oss-20b` | ✅︎ | ✅︎ | | `GraniteForCausalLM` | Granite 3.0, Granite 3.1, PowerLM | `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc. | ✅︎ | ✅︎ | | `GraniteMoeForCausalLM` | Granite 3.0 MoE, PowerMoE | `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc. | ✅︎ | ✅︎ | | `GraniteMoeHybridForCausalLM` | Granite 4.0 MoE Hybrid | `ibm-granite/granite-4.0-tiny-preview`, etc. | ✅︎ | ✅︎ | | `GraniteMoeSharedForCausalLM` | Granite MoE Shared | `ibm-research/moe-7b-1b-active-shared-experts` (test model) | ✅︎ | ✅︎ | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | | `HunYuanDenseV1ForCausalLM` | Hunyuan Dense | `tencent/Hunyuan-7B-Instruct` | ✅︎ | ✅︎ | | `HunYuanMoEV1ForCausalLM` | Hunyuan-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | ✅︎ | ✅︎ | | `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision-Instruct-3B | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | | | | `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ | | `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ | | `InternLM3ForCausalLM` | InternLM3 | `internlm/internlm3-8b-instruct`, etc. | ✅︎ | ✅︎ | | `JAISLMHeadModel` | Jais | `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc. | | ✅︎ | | `Jais2ForCausalLM` | Jais2 | `inceptionai/Jais-2-8B-Chat`, `inceptionai/Jais-2-70B-Chat`, etc. | | ✅︎ | | `JambaForCausalLM` | Jamba | `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc. | ✅︎ | ✅︎ | | `KimiLinearForCausalLM` | Kimi-Linear-48B-A3B-Base, Kimi-Linear-48B-A3B-Instruct | `moonshotai/Kimi-Linear-48B-A3B-Base`, `moonshotai/Kimi-Linear-48B-A3B-Instruct` | | ✅︎ | | `Lfm2ForCausalLM` | LFM2 | `LiquidAI/LFM2-1.2B`, `LiquidAI/LFM2-700M`, `LiquidAI/LFM2-350M`, etc. | ✅︎ | ✅︎ | | `Lfm2MoeForCausalLM` | LFM2MoE | `LiquidAI/LFM2-8B-A1B-preview`, etc. | ✅︎ | ✅︎ | | `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA, Yi | `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. | ✅︎ | ✅︎ | | `MambaForCausalLM` | Mamba | `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. | | ✅︎ | | `Mamba2ForCausalLM` | Mamba2 | `mistralai/Mamba-Codestral-7B-v0.1`, etc. | | ✅︎ | | `MiMoForCausalLM` | MiMo | `XiaomiMiMo/MiMo-7B-RL`, etc. | ✅︎ | ✅︎ | | `MiMoV2FlashForCausalLM` | MiMoV2Flash | `XiaomiMiMo/MiMo-V2-Flash`, etc. | ︎| ✅︎ | | `MiniCPMForCausalLM` | MiniCPM | `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. | ✅︎ | ✅︎ | | `MiniCPM3ForCausalLM` | MiniCPM3 | `openbmb/MiniCPM3-4B`, etc. | ✅︎ | ✅︎ | | `MiniMaxM2ForCausalLM` | MiniMax-M2, MiniMax-M2.1 |`MiniMaxAI/MiniMax-M2`, etc. | | ✅︎ | | `MistralForCausalLM` | Ministral-3, Mistral, Mistral-Instruct | `mistralai/Ministral-3-3B-Instruct-2512`, `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. | ✅︎ | ✅︎ | | `MistralLarge3ForCausalLM` | Mistral-Large-3-675B-Base-2512, Mistral-Large-3-675B-Instruct-2512 | `mistralai/Mistral-Large-3-675B-Base-2512`, `mistralai/Mistral-Large-3-675B-Instruct-2512`, etc. | ✅︎ | ✅︎ | | `MixtralForCausalLM` | Mixtral-8x7B, Mixtral-8x7B-Instruct | `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. | ✅︎ | ✅︎ | | `MPTForCausalLM` | MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter | `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc. | | ✅︎ | | `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. | ✅︎ | ✅︎ | | `NemotronHForCausalLM` | Nemotron-H | `nvidia/Nemotron-H-8B-Base-8K`, `nvidia/Nemotron-H-47B-Base-8K`, `nvidia/Nemotron-H-56B-Base-8K`, etc. | ✅︎ | ✅︎ | | `OLMoForCausalLM` | OLMo | `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc. | ✅︎ | ✅︎ | | `OLMo2ForCausalLM` | OLMo2 | `allenai/OLMo-2-0425-1B`, etc. | ✅︎ | ✅︎ | | `OLMo3ForCausalLM` | OLMo3 | `allenai/Olmo-3-7B-Instruct`, `allenai/Olmo-3-32B-Think`, etc. | ✅︎ | ✅︎ | | `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | | ✅︎ | | `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ | | `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ | | `OuroForCausalLM` | ouro | `ByteDance/Ouro-1.4B`, `ByteDance/Ouro-2.6B`, etc. | ✅︎ | | | `PanguEmbeddedForCausalLM` |openPangu-Embedded-7B | `FreedomIntelligence/openPangu-Embedded-7B-V1.1` | ✅︎ | ✅︎ | | `PanguProMoEV2ForCausalLM` |openpangu-pro-moe-v2 | | ✅︎ | ✅︎ | | `PanguUltraMoEForCausalLM` |openpangu-ultra-moe-718b-model | `FreedomIntelligence/openPangu-Ultra-MoE-718B-V1.1` | ✅︎ | ✅︎ | | `PhiForCausalLM` | Phi | `microsoft/phi-1_5`, `microsoft/phi-2`, etc. | ✅︎ | ✅︎ | | `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ | | `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ | | `PersimmonForCausalLM` | Persimmon | `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. | | ✅︎ | | `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | ✅ | ✅︎ | | `Plamo3ForCausalLM` | PLaMo3 | `pfnet/plamo-3-nict-2b-base`, `pfnet/plamo-3-nict-8b-base`, etc. | ✅ | ✅︎ | | `QWenLMHeadModel` | Qwen | `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. | ✅︎ | ✅︎ | | `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc. | ✅︎ | ✅︎ | | `Qwen2MoeForCausalLM` | Qwen2MoE | `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. | ✅︎ | ✅︎ | | `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B`, etc. | ✅︎ | ✅︎ | | `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B`, etc. | ✅︎ | ✅︎ | | `Qwen3NextForCausalLM` | Qwen3NextMoE | `Qwen/Qwen3-Next-80B-A3B-Instruct`, etc. | ✅︎ | ✅︎ | | `SeedOssForCausalLM` | SeedOss | `ByteDance-Seed/Seed-OSS-36B-Instruct`, etc. | ✅︎ | ✅︎ | | `StableLmForCausalLM` | StableLM | `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. | | | | `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | | ✅︎ | | `SolarForCausalLM` | Solar Pro | `upstage/solar-pro-preview-instruct`, etc. | ✅︎ | ✅︎ | | `TeleChat2ForCausalLM` | TeleChat2 | `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc. | ✅︎ | ✅︎ | | `TeleFLMForCausalLM` | TeleFLM | `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc. | ✅︎ | ✅︎ | | `XverseForCausalLM` | XVERSE | `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc. | ✅︎ | ✅︎ | | `MiniMaxM1ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`, etc. | | | | `MiniMaxText01ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01`, etc. | | | | `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | | | `LongcatFlashForCausalLM` | LongCat-Flash | `meituan-longcat/LongCat-Flash-Chat`, `meituan-longcat/LongCat-Flash-Chat-FP8` | ✅︎ | ✅︎ | Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it! | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `SmolLM3ForCausalLM` | SmolLM3 | `HuggingFaceTB/SmolLM3-3B` | ✅︎ | ✅︎ | !!! note Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096. ### Pooling Models See [this page](./pooling_models.md) for more information on how to use pooling models. !!! important Since some model architectures support both generative and pooling tasks, you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode. #### Embedding These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `BertModel`^C | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | | `BertSpladeSparseEmbeddingModel` | SPLADE | `naver/splade-v3` | | | | `Gemma2Model`^C | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | ✅︎ | | `Gemma3TextModel`^C | Gemma 3-based | `google/embeddinggemma-300m`, etc. | ✅︎ | ✅︎ | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | `GteModel`^C | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | | `GteNewModel`^C | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | | `ModernBertModel`^C | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | | `NomicBertModel`^C | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | | `LlamaBidirectionalModel`^C | Llama-based with bidirectional attention | `nvidia/llama-nemotron-embed-1b-v2`, etc. | ✅︎ | ✅︎ | | `LlamaModel`^C, `LlamaForCausalLM`^C, `MistralModel`^C, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | | `Qwen2Model`^C, `Qwen2ForCausalLM`^C | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | | `Qwen3Model`^C, `Qwen3ForCausalLM`^C | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | | `*Model`^C, `*ForCausalLM`^C, etc. | Generative models | N/A | \* | \* | ^C Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion)) \* Feature support is the same as that of the original model. !!! note `ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config. You need to manually set mean pooling by passing `--pooler-config '{"pooling_type": "MEAN"}'`. !!! note For `Alibaba-NLP/gte-Qwen2-*`, you need to enable `--trust-remote-code` for the correct tokenizer to be loaded. See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882). !!! note `jinaai/jina-embeddings-v3` supports multiple tasks through LoRA, while vllm temporarily only supports text-matching tasks by merging LoRA weights. !!! note The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture. If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings of the whole prompt are extracted from the normalized hidden state corresponding to the last token. #### Classification These models primarily support the [`LLM.classify`](./pooling_models.md#llmclassify) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | | `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | | `*Model`^C, `*ForCausalLM`^C, etc. | Generative models | N/A | \* | \* | ^C Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion)) \* Feature support is the same as that of the original model. If your model is not in the above list, we will try to automatically convert the model using [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token. #### Cross-encoder / Reranker Cross-encoder and reranker models are a subset of classification models that accept two prompts as input. These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | | `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | | `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | | | | `LlamaBidirectionalForSequenceClassification`^C | Llama-based with bidirectional attention | `nvidia/llama-nemotron-rerank-1b-v2` (see note), etc. | ✅︎ | ✅︎ | | `Qwen2ForSequenceClassification`^C | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | | `Qwen3ForSequenceClassification`^C | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | | `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | | `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | | `*Model`^C, `*ForCausalLM`^C, etc. | Generative models | N/A | \* | \* | ^C Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion)) \* Feature support is the same as that of the original model. !!! note Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command. ```bash vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}' ``` !!! note The second-generation GTE model (mGTE-TRM) is named `NewForSequenceClassification`. The name `NewForSequenceClassification` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}'` to specify the use of the `GteNewForSequenceClassification` architecture. !!! note `nvidia/llama-nemotron-rerank-1b-v2` require a specific prompt format to work correctly. Examples : [offline_using_template.py](../../examples/pooling/score/offline_using_template.py) [online_using_template.py](../../examples/pooling/score/online_using_template.py) !!! note Load the official original `mxbai-rerank-v2` by using the following command. ```bash vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}' ``` !!! note Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/pooling/score/offline_reranker.py](../../examples/pooling/score/offline_reranker.py). ```bash vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' ``` #### Reward Modeling These models primarily support the [`LLM.reward`](./pooling_models.md#llmreward) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | | `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | | `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | | `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | !!! important For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly, e.g.: `--pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`. #### Token Classification These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode) API. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|-----------------------------|-----------------------------------------| | `BertForTokenClassification` | bert-based | `boltuix/NeuroBERT-NER` (see note), etc. | | | | `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | | !!! note Named Entity Recognition (NER) usage, please refer to [examples/pooling/token_classify/ner.py](../../examples/pooling/token_classify/ner.py), [examples/pooling/token_classify/ner_client.py](../../examples/pooling/token_classify/ner_client.py). ## List of Multimodal Language Models The following modalities are supported depending on the model: - **T**ext - **I**mage - **V**ideo - **A**udio Any combination of modalities joined by `+` are supported. - e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs. On the other hand, modalities separated by `/` are mutually exclusive. - e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs. See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model. !!! tip For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache. !!! note vLLM currently only supports dynamic LoRA adapters on the language backbone of multimodal models. If you wish to use a model with LoRA in the multi-modal encoder, please merge the weights into the base model first before running it in vLLM like a regular model. ```python from peft import PeftConfig, PeftModel from transformers import AutoModelForImageTextToText, AutoProcessor def merge_and_save(model_id: str, output_dir: str): base_model = AutoModelForImageTextToText.from_pretrained(model_id) lora_model = PeftModel.from_pretrained( base_model, model_id, config=PeftConfig.from_pretrained(model_id), ) model = lora_model.merge_and_unload().to(dtype=base_model.dtype) model._hf_peft_config_loaded = False # Needed to save the merged model processor = AutoProcessor.from_pretrained(model_id) model.save_pretrained(output_dir) processor.save_pretrained(output_dir) ``` ### Generative Models See [this page](generative_models.md) for more information on how to use generative models. #### Text Generation These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API. | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|--------|-------------------|----------------------|---------------------------| | `AriaForConditionalGeneration` | Aria | T + I⁺ | `rhymes-ai/Aria` | | | | `AudioFlamingo3ForConditionalGeneration` | AudioFlamingo3 | T + A⁺ | `nvidia/audio-flamingo-3-hf`, `nvidia/music-flamingo-hf` | ✅︎ | ✅︎ | | `AyaVisionForConditionalGeneration` | Aya Vision | T + I⁺ | `CohereLabs/aya-vision-8b`, `CohereLabs/aya-vision-32b`, etc. | | ✅︎ | | `BagelForConditionalGeneration` | BAGEL | T + I⁺ | `ByteDance-Seed/BAGEL-7B-MoT` | ✅︎ | ✅︎ | | `BeeForConditionalGeneration` | Bee-8B | T + I^E+ | `Open-Bee/Bee-8B-RL`, `Open-Bee/Bee-8B-SFT` | | ✅︎ | | `Blip2ForConditionalGeneration` | BLIP-2 | T + I^E | `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc. | | ✅︎ | | `ChameleonForConditionalGeneration` | Chameleon | T + I | `facebook/chameleon-7b`, etc. | | ✅︎ | | `Cohere2VisionForConditionalGeneration` | Command A Vision | T + I⁺ | `CohereLabs/command-a-vision-07-2025`, etc. | | ✅︎ | | `DeepseekVLV2ForCausalLM`^{^} | DeepSeek-VL2 | T + I⁺ | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, etc. | | ✅︎ | | `DeepseekOCRForCausalLM` | DeepSeek-OCR | T + I⁺ | `deepseek-ai/DeepSeek-OCR`, etc. | | ✅︎ | | `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I⁺/ V⁺ | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ | | `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ | | `Gemma3ForConditionalGeneration` | Gemma 3 | T + I^E+ | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | | `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | | `GLM4VForCausalLM`^{^} | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ | | `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I^E+ + V^E+ | `zai-org/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ | | `Glm4vMoeForConditionalGeneration` | GLM-4.5V | T + I^E+ + V^E+ | `zai-org/GLM-4.5V`, etc. | ✅︎ | ✅︎ | | `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | | `H2OVLChatModel` | H2OVL | T + I^E+ | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | | `HunYuanVLForConditionalGeneration` | HunyuanOCR | T + I^E+ | `tencent/HunyuanOCR`, etc. | ✅︎ | ✅︎ | | `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3`, etc. | ✅︎ | | | `IsaacForConditionalGeneration` | Isaac | T + I⁺ | `PerceptronAI/Isaac-0.1` | ✅︎ | ✅︎ | | `InternS1ForConditionalGeneration` | Intern-S1 | T + I^E+ + V^E+ | `internlm/Intern-S1`, `internlm/Intern-S1-mini`, etc. | ✅︎ | ✅︎ | | `InternVLChatModel` | InternVL 3.5, InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0 | T + I^E+ + (V^E+) | `OpenGVLab/InternVL3_5-14B`, `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. | ✅︎ | ✅︎ | | `InternVLForConditionalGeneration` | InternVL 3.0 (HF format) | T + I^E+ + V^E+ | `OpenGVLab/InternVL3-1B-hf`, etc. | ✅︎ | ✅︎ | | `KeyeForConditionalGeneration` | Keye-VL-8B-Preview | T + I^E+ + V^E+ | `Kwai-Keye/Keye-VL-8B-Preview` | ✅︎ | ✅︎ | | `KeyeVL1_5ForConditionalGeneration` | Keye-VL-1_5-8B | T + I^E+ + V^E+ | `Kwai-Keye/Keye-VL-1_5-8B` | ✅︎ | ✅︎ | | `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I⁺ | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | ✅︎ | | `LightOnOCRForConditionalGeneration` | LightOnOCR-1B | T + I⁺ | `lightonai/LightOnOCR-1B`, etc | ✅︎ | ✅︎ | | `Llama4ForConditionalGeneration` | Llama 4 | T + I⁺ | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | ✅︎ | ✅︎ | | `Llama_Nemotron_Nano_VL` | Llama Nemotron Nano VL | T + I^E+ | `nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1` | ✅︎ | ✅︎ | | `LlavaForConditionalGeneration` | LLaVA-1.5, Pixtral (HF Transformers) | T + I^E+ | `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), `mistral-community/pixtral-12b`, etc. | | ✅︎ | | `LlavaNextForConditionalGeneration` | LLaVA-NeXT | T + I^E+ | `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc. | | ✅︎ | | `LlavaNextVideoForConditionalGeneration` | LLaVA-NeXT-Video | T + V | `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. | | ✅︎ | | `LlavaOnevisionForConditionalGeneration` | LLaVA-Onevision | T + I⁺ + V⁺ | `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. | | ✅︎ | | `MiDashengLMModel` | MiDashengLM | T + A⁺ | `mispeech/midashenglm-7b` | | ✅︎ | | `MiniCPMO` | MiniCPM-O | T + I^E+ + V^E+ + A^E+ | `openbmb/MiniCPM-o-2_6`, etc. | ✅︎ | ✅︎ | | `MiniCPMV` | MiniCPM-V | T + I^E+ + V^E+ | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-4_5`, etc. | ✅︎ | | | `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + I^E+ | `MiniMaxAI/MiniMax-VL-01`, etc. | | ✅︎ | | `Mistral3ForConditionalGeneration` | Mistral3 (HF Transformers) | T + I⁺ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | | `MolmoForCausalLM` | Molmo | T + I⁺ | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | | `NVLM_D_Model` | NVLM-D 1.0 | T + I⁺ | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | | `OpenCUAForConditionalGeneration` | OpenCUA-7B | T + I^E+ | `xlangai/OpenCUA-7B` | ✅︎ | ✅︎ | | `Ovis` | Ovis2, Ovis1.6 | T + I⁺ | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ | | `Ovis2_5` | Ovis2.5 | T + I⁺ + V | `AIDC-AI/Ovis2.5-9B`, etc. | | | | `PaddleOCRVLForConditionalGeneration` | Paddle-OCR | T + I⁺ | `PaddlePaddle/PaddleOCR-VL`, etc. | | | | `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I^E | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | | ✅︎ | | `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + I^E+ | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | | ✅︎ | | `Phi4MMForCausalLM` | Phi-4-multimodal | T + I⁺ / T + A⁺ / I⁺ + A⁺ | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ | | `PixtralForConditionalGeneration` | Ministral 3 (Mistral format), Mistral 3 (Mistral format), Mistral Large 3 (Mistral format), Pixtral (Mistral format) | T + I⁺ | `mistralai/Ministral-3-3B-Instruct-2512`, `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistralai/Mistral-Large-3-675B-Instruct-2512` `mistralai/Pixtral-12B-2409` etc. | | ✅︎ | | `QwenVLForConditionalGeneration`^{^} | Qwen-VL | T + I^E+ | `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. | ✅︎ | ✅︎ | | `Qwen2AudioForConditionalGeneration` | Qwen2-Audio | T + A⁺ | `Qwen/Qwen2-Audio-7B-Instruct` | | ✅︎ | | `Qwen2VLForConditionalGeneration` | QVQ, Qwen2-VL | T + I^E+ + V^E+ | `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc. | ✅︎ | ✅︎ | | `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | T + I^E+ + V^E+ | `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc. | ✅︎ | ✅︎ | | `Qwen2_5OmniThinkerForConditionalGeneration` | Qwen2.5-Omni | T + I^E+ + V^E+ + A⁺ | `Qwen/Qwen2.5-Omni-3B`, `Qwen/Qwen2.5-Omni-7B` | ✅︎ | ✅︎ | | `Qwen3VLForConditionalGeneration` | Qwen3-VL | T + I^E+ + V^E+ | `Qwen/Qwen3-VL-4B-Instruct`, etc. | ✅︎ | ✅︎ | | `Qwen3VLMoeForConditionalGeneration` | Qwen3-VL-MOE | T + I^E+ + V^E+ | `Qwen/Qwen3-VL-30B-A3B-Instruct`, etc. | ✅︎ | ✅︎ | | `Qwen3OmniMoeThinkerForConditionalGeneration` | Qwen3-Omni | T + I^E+ + V^E+ + A⁺ | `Qwen/Qwen3-Omni-30B-A3B-Instruct`, `Qwen/Qwen3-Omni-30B-A3B-Thinking` | ✅︎ | ✅︎ | | `RForConditionalGeneration` | R-VL-4B | T + I^E+ | `YannQi/R-4B` | | ✅︎ | | `SkyworkR1VChatModel` | Skywork-R1V-38B | T + I | `Skywork/Skywork-R1V-38B` | | ✅︎ | | `SmolVLMForConditionalGeneration` | SmolVLM2 | T + I | `SmolVLM2-2.2B-Instruct` | ✅︎ | | | `Step3VLForConditionalGeneration` | Step3-VL | T + I⁺ | `stepfun-ai/step3` | | ✅︎ | | `TarsierForConditionalGeneration` | Tarsier | T + I^E+ | `omni-search/Tarsier-7b`, `omni-search/Tarsier-34b` | | ✅︎ | | `Tarsier2ForConditionalGeneration`^{^} | Tarsier2 | T + I^E+ + V^E+ | `omni-research/Tarsier2-Recap-7b`, `omni-research/Tarsier2-7b-0115` | | ✅︎ | | `UltravoxModel` | Ultravox | T + A^E+ | `fixie-ai/ultravox-v0_5-llama-3_2-1b` | ✅︎ | ✅︎ | Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it! | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------| | `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ | ^{^} You need to set the architecture name via `--hf-overrides` to match the one in vLLM. • For example, to use DeepSeek-VL2 series models: `--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` ^E Pre-computed embeddings can be inputted for this modality. ⁺ Multiple items can be inputted per text prompt for this modality. !!! note `Gemma3nForConditionalGeneration` is only supported on V1 due to shared KV caching and it depends on `timm>=1.0.17` to make use of its MobileNet-v5 vision backbone. Performance is not yet fully optimized mainly due to: - Both audio and vision MM encoders use `transformers.AutoModel` implementation. - There's no PLE caching or out-of-memory swapping support, as described in [Google's blog](https://developers.googleblog.com/en/introducing-gemma-3n/). These features might be too model-specific for vLLM, and swapping in particular may be better suited for constrained setups. !!! note For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc.), InternVL3 and InternVL3.5 have video inputs support currently. !!! note To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM. !!! note The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now. For more details, please see: !!! note For Qwen2.5-Omni and Qwen3-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) is currently work in progress and not yet supported. #### Transcription Speech2Text models trained specifically for Automatic Speech Recognition. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|-------------------|----------------------|---------------------------| | `Gemma3nForConditionalGeneration` | Gemma3n | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | | `GlmAsrForConditionalGeneration` | GLM-ASR | `zai-org/GLM-ASR-Nano-2512` | ✅︎ | ✅︎ | | `GraniteSpeechForConditionalGeneration` | Granite Speech | `ibm-granite/granite-speech-3.3-2b`, `ibm-granite/granite-speech-3.3-8b`, etc. | ✅︎ | ✅︎ | | `VoxtralForConditionalGeneration` | Voxtral (Mistral format) | `mistralai/Voxtral-Mini-3B-2507`, `mistralai/Voxtral-Small-24B-2507`, etc. | ✅︎ | ✅︎ | | `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | !!! note `VoxtralForConditionalGeneration` requires `mistral-common[audio]` to be installed. ### Pooling Models See [this page](./pooling_models.md) for more information on how to use pooling models. #### Embedding These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API. !!! note To get the best results, you should use pooling models that are specifically trained as such. The following table lists those that are tested in vLLM. | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|--------|-------------------|----------------------|---------------------------| | `CLIPModel` | CLIP | T / I | `openai/clip-vit-base-patch32`, `openai/clip-vit-large-patch14`, etc. | | | | `LlavaNextForConditionalGeneration`^C | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | ✅︎ | | `Phi3VForCausalLM`^C | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | | ✅︎ | | `SiglipModel` | SigLIP, SigLIP2 | T / I | `google/siglip-base-patch16-224`, `google/siglip2-base-patch16-224` | | | | `*ForConditionalGeneration`^C, `*ForCausalLM`^C, etc. | Generative models | \* | N/A | \* | \* | ^C Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion)) \* Feature support is the same as that of the original model. --- #### Cross-encoder / Reranker Cross-encoder and reranker models are a subset of classification models that accept two prompts as input. These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API. | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | |--------------|--------|--------|-------------------|----------------------|---------------------------| | `JinaVLForSequenceClassification` | JinaVL-based | T + I^E+ | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ | ^C Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion)) \* Feature support is the same as that of the original model. ## Model Support Policy At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support: 1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated! 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results. !!! tip When comparing the output of `model.generate` from Hugging Face Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs. 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback. 4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use. 5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement. Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem. Note that, as an inference engine, vLLM does not introduce new models. Therefore, all models supported by vLLM are third-party models in this regard. We have the following levels of testing for models: 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. --- # Context Parallel Deployment Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are: - For long context prefill, we need to control the TTFT (time to first token) by amortizing the computation time of the prefill across query tokens. - For long context decode, we need more space for KV cache to increase the batchsize (and hence the throughput). ## Prefill Context Parallel During prefill, for a long request with `T` new tokens, we need to compute query/key/value tensors for these new tokens. Say we have `N` GPUs, we can split the request into `N` chunks, and each GPU computes one chunk of the query/key/value tensors. Depending on the use case, there're two possible strategies: 1. Partial query, full key/value: If the request token length is moderately long (we can afford holding the full key/value tensors), and the goal is to accelerate the prefill (and amortize the computation time of the prefill across query tokens), then we can gather the key/value tensors from all GPUs and let each GPU compute the attention output corresponding to the query tokens of its chunk. 2. Partial query, partial key/value: If the request token length is too long, we cannot afford holding the full key/value tensors anymore, then we can only compute one chunk of query/key/value tensors for each GPU, and use techniques like [ring-attention](http://arxiv.org/abs/2310.01889) to send/recv key/value tensors chunk by chunk. Both approaches are under active development. ## Decode Context Parallel Due to the auto-regressive nature of decoding, every decoding step needs to compute a small amount of query tokens w.r.t. a large number of key/value tokens stored in the paged KV cache. The core of decode context parallel is how to shard the KV cache across GPUs. For a model with `H` kv-heads, a request with `T` tokens in the context needs to store `H * T` key/value tensors in the KV cache. 1. If one GPU can hold them all, and the performance is good enough, then no parallelization is needed. 2. If one GPU cannot hold them all, or we want to hold more requests in the KV cache, we can first shard the KV cache along the `H` dimension, that's the plain tensor parallel sharding. It's as simple as adding `-tp ` to the command line. 3. Since `H` is limited (determined by the model architecture), when we continue to increase the tensor parallel size, the KV cache for each GPU will be duplicated for `tp_size / H` times. Of course, duplication is not good for efficiency. Then we need to add decode context parallel to further shard the KV cache along the `T` dimension. This is as simple as adding `-dcp ` to the command line. Note that `size` does not increase the number of GPUs we need to launch, but just reduces the KV cache duplication. The dcp size should lie in the range of `[1, tp_size/H]`. With larger dcp size, the KV cache duplication is reduced, but the communication overhead increases. Theoretically, it is possible to extend the dcp size beyond `tp_size / H` to further shard the KV cache and accelerate the decoding phase. However, since the number of query tokens is limited in decoding, it's unclear what should we do for the remaining `dcp_size - tp_size / H` GPUs for non-attention layers. For the sake of simplicity, dcp size is upper bounded by `tp_size / H`. If you want to further accelerate the decoding phase, you can consider increasing the `tp_size` first, and then increasing the dcp size. Note that kv cache can grow during decoding, and the sharding strategy needs to be carefully implemented. We use an interleaving strategy to shard the KV cache along the `T` dimension, so that kv cache for future tokens can be naturally sharded along the `T` dimension. This is proposed by [Chao Hong from Moonshot](https://github.com/youzhedian), and also explained in details in [this paper](http://arxiv.org/abs/2507.07120). Case study: For DeepSeek-R1, we have 1 kv-head when MLA is enabled. The typical single-node deployment with `-tp 8` causes 8x KV cache duplication. We can consider adding `-dcp 8` to reduce the KV cache duplication. For Kimi-K2, the architecture is similar to DeepSeek-R1, but with more parameters. When we deploy it with `-tp 16`, the KV cache duplication is 16x. We can add `-dcp 16` to completely remove the KV cache duplication, at the cost of more communication overhead. We can also add `-dcp 8` to reduce the KV cache duplication to 2x. Although it still duplicates the KV cache twice, the communication overhead is smaller since the DCP communication only happens inside one node. For Qwen3-235B-A22B, we have 4 kv-heads. When we deploy it with `-tp 8`, the KV cache duplication is 2x. Then we can add `-dcp 2` to remove the KV cache duplication. In short, for decode context parallel, try to increase `-tp` size until you get satisfactory performance, and then add `-dcp` to reduce the KV cache duplication. Decode context parallel is supported in vLLM, for both MLA and GQA models. Some attention backends also support the combination of decode context parallel and MTP (multi-token prediction) to further accelerate the decoding phase. ## Technical Discussions The main discussions happen in the `#sig-context-parallel` channel of [vLLM Slack](https://slack.vllm.ai/). --- # Data Parallel Deployment vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Latent Attention), it can be advantageous to use data parallel for the attention layers and expert or tensor parallel (EP or TP) for the expert layers. In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks. By default, expert layers form a tensor parallel group of size `DP × TP`. To use expert parallelism instead, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case). See [Expert Parallel Deployment](expert_parallel_deployment.md) for details on how attention and expert layers behave differently with EP enabled. In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size. For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form a group of size `DP × TP` (using either tensor parallelism by default, or expert parallelism if `--enable-expert-parallel` is set). In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently. This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py). There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing. ## Internal Load Balancing vLLM supports "self-contained" data parallel deployments that expose a single API endpoint. It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs. When sizing DP deployments, remember that `--max-num-seqs` applies per DP rank. Running a single data parallel deployment across multiple nodes requires a different `vllm serve` to be run on each node, specifying which DP ranks should run on that node. In this case, there will still be a single HTTP entrypoint - the API server(s) will run only on one node, but it doesn't necessarily need to be co-located with the DP ranks. This will run DP=4, TP=2 on a single 8-GPU node: ```bash vllm serve $MODEL --data-parallel-size 4 --tensor-parallel-size 2 ``` This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node: ```bash # Node 0 (with ip address 10.99.48.128) vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 # Node 1 vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 \ --data-parallel-start-rank 2 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 ``` This will run DP=4 with only the API server on the first node and all engines on the second node: ```bash # Node 0 (with ip address 10.99.48.128) vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 # Node 1 vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 ``` This DP mode can also be used with Ray by specifying `--data-parallel-backend=ray`: ```bash vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \ --data-parallel-backend=ray ``` There are several notable differences when using Ray: - A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node - There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address` - There is no need to specify `--data-parallel-rpc-port` - When a single DP group requires multiple nodes, *e.g.* in case a single model replica needs to run on at least two nodes, make sure to set `VLLM_RAY_DP_PACK_STRATEGY="span"` in which case `--data-parallel-size-local` is ignored and will be automatically determined - Remote DP ranks will be allocated based on node resources of the Ray cluster Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic. When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node. ![DP Internal LB Diagram](../assets/deployment/dp_internal_lb.png) ## Hybrid Load Balancing Hybrid load balancing sits between the internal and external approaches. Each node runs its own API server(s) that only queue requests to the data-parallel engines colocated on that node. An upstream load balancer (for example, an ingress controller or traffic router) spreads user requests across those per-node endpoints. Enable this mode with `--data-parallel-hybrid-lb` while still launching every node with the global data-parallel size. The key differences from internal load balancing are: - You must provide `--data-parallel-size-local` and `--data-parallel-start-rank` so each node knows which ranks it owns. - Not compatible with `--headless` since every node exposes an API endpoint. - Scale `--api-server-count` per node based on the number of local ranks In this configuration, each node keeps scheduling decisions local, which reduces cross-node traffic and avoids single node bottlenecks at larger DP sizes. ## External Load Balancing For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally. In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions. This can already be done trivially for non-MoE models, since each deployed server is fully independent. No data parallel CLI options need to be used for this. We support an equivalent topology for MoE DP+EP which can be configured via the following CLI arguments. If DP ranks are co-located (same node / ip address), a default RPC port is used, but a different HTTP server port must be specified for each rank: ```bash # Rank 0 CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \ --port 8000 # Rank 1 CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \ --port 8001 ``` For multi-node cases, the address/port of rank 0 must also be specified: ```bash # Rank 0 (with ip address 10.99.48.128) vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 # Rank 1 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \ --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 ``` The coordinator process also runs in this scenario, co-located with the DP rank 0 engine. ![DP External LB Diagram](../assets/deployment/dp_external_lb.png) In the above diagram, each of the dotted boxes corresponds to a separate launch of `vllm serve` - these could be separate Kubernetes pods, for example. --- # Troubleshooting distributed deployments For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md). ## Verify inter-node GPU communication After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script](../usage/troubleshooting.md#incorrect-hardwaredriver). If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . ## No available node types can fulfill resource request The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . ## Ray observability Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the [official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). --- # Expert Parallel Deployment vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall. EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md). ## Prerequisites Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future: 1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels). 2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation). 3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). ### Backend Selection Guide vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one: | Backend | Use Case | Features | Best For | |---------|----------|----------|----------| | `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration | | `pplx` | Single node | Chunked prefill support, efficient intra-node communication | Single-node deployments, development | | `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios | | `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios | | `flashinfer_all2allv` | MNNVL systems | FlashInfer alltoallv kernels for multi-node NVLink | Systems with NVLink across nodes | | `naive` | Testing/debugging | Simple broadcast-based implementation | Debugging, not recommended for production | ## Single Node Deployment !!! warning EP is an experimental feature. Argument names and default values may change in the future. ### Configuration Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as: ```text EP_SIZE = TP_SIZE × DP_SIZE ``` Where: - `TP_SIZE`: Tensor parallel size - `DP_SIZE`: Data parallel size - `EP_SIZE`: Expert parallel size (computed automatically) ### Layer Behavior with EP Enabled When EP is enabled, different layers in MoE models behave differently: | Layer Type | Behavior | Parallelism Used | |------------|----------|------------------| | **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` | | **Attention Layers** | Behavior depends on TP size | See below | **Attention layer parallelism:** - **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism) - **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group For example, with `TP=2, DP=4` (8 GPUs total): - Expert layers form an EP group of size 8, with experts distributed across all GPUs - Attention layers use TP=2 within each of the 4 DP groups !!! note "Key Difference from Data Parallel Deployment" Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models. ### Example Command The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section. ```bash # Single node EP deployment with pplx backend vllm serve deepseek-ai/DeepSeek-V3-0324 \ --tensor-parallel-size 1 \ # Tensor parallelism across 1 GPU --data-parallel-size 8 \ # Data parallelism across 8 processes --enable-expert-parallel \ # Enable expert parallelism --all2all-backend pplx # Use pplx communication backend ``` ## Multi-Node Deployment For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above). ### Deployment Steps 1. **Run one command per node** - Each node requires its own launch command 2. **Configure networking** - Ensure proper IP addresses and port configurations 3. **Set node roles** - First node handles requests, additional nodes run in headless mode ### Example: 2-Node Deployment The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode: ```bash # Node 1 (Primary - handles incoming requests) vllm serve deepseek-ai/DeepSeek-V3-0324 \ --all2all-backend deepep_low_latency \ --tensor-parallel-size 1 \ # TP size per node --enable-expert-parallel \ # Enable EP --data-parallel-size 16 \ # Total DP size across all nodes --data-parallel-size-local 8 \ # Local DP size on this node (8 GPUs per node) --data-parallel-address 192.168.1.100 \ # Replace with actual IP of Node 1 --data-parallel-rpc-port 13345 \ # RPC communication port, can be any port as long as reachable by all nodes --api-server-count=8 # Number of API servers for load handling (scaling this out to # local ranks is recommended) # Node 2 (Secondary - headless mode, no API server) vllm serve deepseek-ai/DeepSeek-V3-0324 \ --all2all-backend deepep_low_latency \ --tensor-parallel-size 1 \ # TP size per node --enable-expert-parallel \ # Enable EP --data-parallel-size 16 \ # Total DP size across all nodes --data-parallel-size-local 8 \ # Local DP size on this node --data-parallel-start-rank 8 \ # Starting rank offset for this node --data-parallel-address 192.168.1.100 \ # IP of primary node (Node 1) --data-parallel-rpc-port 13345 \ # Same RPC port as primary --headless # No API server, worker only ``` ### Key Configuration Notes - **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node - **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes - **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads ### Network Configuration !!! important "InfiniBand Clusters" On InfiniBand networked clusters, set this environment variable to prevent initialization hangs: ```bash export GLOO_SOCKET_IFNAME=eth0 ``` This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup. ## Expert Parallel Load Balancer (EPLB) While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts. ### Configuration Enable EPLB with the `--enable-eplb` flag. When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution. ### EPLB Parameters Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are: | Parameter | Description | Default | |-----------|-------------|---------| | `window_size`| Number of engine steps to track for rebalancing decisions | 1000 | | `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 | | `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` | | `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` | | `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` | | `policy` | The policy type for expert parallel load balancing | `"default"` | For example: ```bash vllm serve Qwen/Qwen3-30B-A3B \ --enable-eplb \ --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}' ``` ??? tip "Prefer individual arguments instead of JSON?" ```bash vllm serve Qwen/Qwen3-30B-A3B \ --enable-eplb \ --eplb-config.window_size 1000 \ --eplb-config.step_interval 3000 \ --eplb-config.num_redundant_experts 2 \ --eplb-config.log_balancedness true ``` ### Expert Distribution Formula - **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts - **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts ### Memory Footprint Overhead EPLB uses redundant experts that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium. This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`. For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per EP rank. ### Example Command Single node deployment with EPLB enabled: ```bash # Single node with EPLB load balancing vllm serve deepseek-ai/DeepSeek-V3-0324 \ --tensor-parallel-size 1 \ # Tensor parallelism --data-parallel-size 8 \ # Data parallelism --enable-expert-parallel \ # Enable EP --all2all-backend pplx \ # Use pplx communication backend --enable-eplb \ # Enable load balancer --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}' ``` For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--eplb-config '{"num_redundant_experts":32}'` to 32 in large scale use cases so the most popular experts are always available. ## Advanced Configuration ### Performance Optimization - **DeepEP kernels**: The `high_throughput` and `low_latency` kernels are optimized for disaggregated serving and may show poor performance for mixed workloads - **Dual Batch Overlap**: Use `--enable-dbo` to overlap all-to-all communication with compute. See [Dual Batch Overlap](../design/dbo.md) for more details. - **Async scheduling (experimental)**: Try `--async-scheduling` to overlap scheduling with model execution. ### Troubleshooting - **`non-zero status: 7 cannot register cq buf`**: When using Infiniband/RoCE, make sure host VM and pods show `ulimit -l` "unlimited". - **`init failed for transport: IBGDA`**: The InfiniBand GDA kernel modules are missing. Run `tools/ep_kernels/configure_system_drivers.sh` on each GPU node and reboot. Also fixes error `NVSHMEM API called before NVSHMEM initialization has completed`. - **NVSHMEM peer disconnect**: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with `hostNetwork: true`, `securityContext.privileged: true` to access Infiniband. ### Benchmarking - Use simulator flags `VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random` and `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` so token routing is balanced across EP ranks. - Increasing `VLLM_MOE_DP_CHUNK_SIZE` may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP to throw `assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2`, which can be fixed by increasing environment variable `NVSHMEM_QP_DEPTH`. ## Disaggregated Serving (Prefill/Decode Split) For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations. ### Architecture Overview - **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance - **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency - **KV Cache Transfer**: Connects instances via NIXL or other KV connectors ### Setup Steps 1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script. 2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'` 3. **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions. ### Client Orchestration Example ```python from openai import OpenAI import uuid try: # 1: Set up clients for prefill and decode instances openai_api_key = "EMPTY" # vLLM doesn't require a real API key # Replace these IP addresses with your actual instance addresses prefill_client = OpenAI( api_key=openai_api_key, base_url="http://192.168.1.100:8000/v1", # Prefill instance URL ) decode_client = OpenAI( api_key=openai_api_key, base_url="http://192.168.1.101:8001/v1", # Decode instance URL ) # Get model name from prefill instance models = prefill_client.models.list() model = models.data[0].id print(f"Using model: {model}") # 2: Prefill Phase # Generate unique request ID to link prefill and decode operations request_id = str(uuid.uuid4()) print(f"Request ID: {request_id}") prefill_response = prefill_client.completions.create( model=model, # Prompt must exceed vLLM's block size (16 tokens) for PD to work prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations", max_tokens=1, # Force prefill-only operation extra_body={ "kv_transfer_params": { "do_remote_decode": True, # Enable remote decode "do_remote_prefill": False, # This is the prefill instance "remote_engine_id": None, # Will be populated by vLLM "remote_block_ids": None, # Will be populated by vLLM "remote_host": None, # Will be populated by vLLM "remote_port": None, # Will be populated by vLLM } }, extra_headers={"X-Request-Id": request_id}, ) print("-" * 50) print("✓ Prefill completed successfully") print(f"Prefill response: {prefill_response.choices[0].text}") # 3: Decode Phase # Transfer KV cache parameters from prefill to decode instance decode_response = decode_client.completions.create( model=model, prompt="This prompt is ignored during decode", # Original prompt not needed max_tokens=150, # Generate up to 150 tokens extra_body={ "kv_transfer_params": prefill_response.kv_transfer_params # Pass KV cache info }, extra_headers={"X-Request-Id": request_id}, # Same request ID ) print("-" * 50) print("✓ Decode completed successfully") print(f"Final response: {decode_response.choices[0].text}") except Exception as e: print(f"❌ Error during disaggregated serving: {e}") print("Check that both prefill and decode instances are running and accessible") ``` ### Benchmarking - To simulate the decode deployment of disaggregated serving, pass `--kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}'` to the `vllm serve` invocation. The connector populates KV cache with random values so decode can be profiled in isolation. - **CUDAGraph capture**: Use `--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` to enable CUDA graph capture for decode only and save KV cache. --- # LangChain vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . To install LangChain, run ```bash pip install langchain langchain_community -q ``` To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`. ??? code ```python from langchain_community.llms import VLLM llm = VLLM( model="Qwen/Qwen3-4B", trust_remote_code=True, # mandatory for hf models max_new_tokens=128, top_k=10, top_p=0.95, temperature=0.8, # for distributed inference # tensor_parallel_size=..., ) print(llm("What is the capital of France ?")) ``` Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details. --- # LlamaIndex vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . To install LlamaIndex, run ```bash pip install llama-index-llms-vllm -q ``` To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`. ```python from llama_index.llms.vllm import Vllm llm = Vllm( model="microsoft/Orca-2-7b", tensor_parallel_size=4, max_new_tokens=100, vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5}, ) ``` Please refer to this [Tutorial](https://docs.llamaindex.ai/en/latest/examples/llm/vllm/) for more details. --- # Offline Inference Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class. For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace and runs it in vLLM using the default configuration. ```python from vllm import LLM # Initialize the vLLM engine. llm = LLM(model="facebook/opt-125m") ``` After initializing the `LLM` instance, use the available APIs to perform model inference. The available APIs depend on the model type: - [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text. - [Pooling models](../models/pooling_models.md) output their hidden states directly. !!! info [API Reference](../api/README.md#offline-inference) ## Ray Data LLM API Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference: - Streaming execution processes datasets that exceed aggregate cluster memory. - Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance. - Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization. - Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference. - Reading and writing to most popular file formats and cloud object storage. - Scaling up the workload without code changes. ??? code ```python import ray # Requires ray>=2.44.1 from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct") processor = build_llm_processor( config, preprocess=lambda row: { "messages": [ {"role": "system", "content": "You are a bot that completes unfinished haikus."}, {"role": "user", "content": row["item"]}, ], "sampling_params": {"temperature": 0.3, "max_tokens": 250}, }, postprocess=lambda row: {"answer": row["generated_text"]}, ) ds = ray.data.from_items(["An old silent pond..."]) ds = processor(ds) ds.write_parquet("local:///tmp/data/") ``` For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html). --- # OpenAI-Compatible Server vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`](../configuration/serve_args.md) command. (You can also use our [Docker](../deployment/docker.md) image.) ```bash vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ --dtype auto \ --api-key token-abc123 ``` To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python). ??? code ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123", ) completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "user", "content": "Hello!"}, ], ) print(completion.choices[0].message) ``` !!! tip vLLM supports some parameters that are not supported by OpenAI, `top_k` for example. You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`. !!! important By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator. To disable this behavior, please pass `--generation-config vllm` when launching the server. ## Supported APIs We currently support the following OpenAI APIs: - [Completions API](#completions-api) (`/v1/completions`) - Only applicable to [text generation models](../models/generative_models.md). - *Note: `suffix` parameter is not supported.* - [Responses API](#responses-api) (`/v1/responses`) - Only applicable to [text generation models](../models/generative_models.md). - [Chat Completions API](#chat-api) (`/v1/chat/completions`) - Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template). - *Note: `user` parameter is ignored.* - *Note:* Setting the `parallel_tool_calls` parameter to `false` ensures vLLM only returns zero or one tool call per request. Setting it to `true` (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to `true`, as that behavior is model dependent and not all models are designed to support parallel tool calls. - [Embeddings API](#embeddings-api) (`/v1/embeddings`) - Only applicable to [embedding models](../models/pooling_models.md). - [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`) - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription). - [Translation API](#translations-api) (`/v1/audio/translations`) - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription). In addition, we have the following custom APIs: - [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`) - Applicable to any model with a tokenizer. - [Pooling API](#pooling-api) (`/pooling`) - Applicable to all [pooling models](../models/pooling_models.md). - [Classification API](#classification-api) (`/classify`) - Only applicable to [classification models](../models/pooling_models.md). - [Score API](#score-api) (`/score`) - Applicable to [embedding models and cross-encoder models](../models/pooling_models.md). - [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`) - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/) - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank) - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response. - Only applicable to [cross-encoder models](../models/pooling_models.md). ## Chat Template In order for the language model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how roles, messages, and other chat-specific tokens are encoded in the input. An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models) Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those models, you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat template, or the template in string form. Without a chat template, the server will not be able to process chat and all chat requests will error. ```bash vllm serve --chat-template ./path-to-chat-template.jinja ``` vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory. With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies both a `type` and a `text` field. An example is provided below: ```python completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ { "role": "user", "content": [ {"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}, ], }, ], ) ``` Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like `meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the request. vLLM provides best-effort support to detect this automatically, which is logged as a string like *"Detected the chat template content format to be..."*, and internally converts incoming requests to match the detected format, which can be one of: - `"string"`: A string. - Example: `"Hello world"` - `"openai"`: A list of dictionaries, similar to OpenAI schema. - Example: `[{"type": "text", "text": "Hello world!"}]` If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument to override which format to use. ## Extra Parameters vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly. ```python completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}, ], extra_body={ "structured_outputs": {"choice": ["positive", "negative"]}, }, ) ``` ## Extra HTTP Headers Only `X-Request-Id` HTTP request header is supported for now. It can be enabled with `--enable-request-id-headers`. ??? code ```python completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}, ], extra_headers={ "x-request-id": "sentiment-classification-00001", }, ) print(completion._request_id) completion = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="A robot may not injure a human being", extra_headers={ "x-request-id": "completion-test", }, ) print(completion._request_id) ``` ## API Reference ### Completions API Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py) #### Extra parameters The following [sampling parameters](../api/README.md#inference-parameters) are supported. ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params" ``` The following extra parameters are supported: ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params" ``` ### Chat API Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. We support both [Vision](https://platform.openai.com/docs/guides/vision)- and [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information. - *Note: `image_url.detail` parameter is not supported.* Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py) #### Extra parameters The following [sampling parameters](../api/README.md#inference-parameters) are supported. ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params" ``` The following extra parameters are supported: ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params" ``` ### Responses API Our Responses API is compatible with [OpenAI's Responses API](https://platform.openai.com/docs/api-reference/responses); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. Code example: [examples/online_serving/openai_responses_client_with_tools.py](../../examples/online_serving/openai_responses_client_with_tools.py) #### Extra parameters The following extra parameters in the request object are supported: ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:responses-extra-params" ``` The following extra parameters in the response object are supported: ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:responses-response-extra-params" ``` ### Embeddings API Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. Code example: [examples/pooling/embed/openai_embedding_client.py](../../examples/pooling/embed/openai_embedding_client.py) If the model has a [chat template](../serving/openai_compatible_server.md#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api)) which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations: ??? code ```python from openai import OpenAI from openai._types import NOT_GIVEN, NotGiven from openai.types.chat import ChatCompletionMessageParam from openai.types.create_embedding_response import CreateEmbeddingResponse def create_chat_embeddings( client: OpenAI, *, messages: list[ChatCompletionMessageParam], model: str, encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN, ) -> CreateEmbeddingResponse: return client.post( "/embeddings", cast_to=CreateEmbeddingResponse, body={"messages": messages, "model": model, "encoding_format": encoding_format}, ) ``` #### Multi-modal inputs You can pass multi-modal inputs to embedding models by defining a custom chat template for the server and passing a list of `messages` in the request. Refer to the examples below for illustration. === "VLM2Vec" To serve the model: ```bash vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \ --trust-remote-code \ --max-model-len 4096 \ --chat-template examples/template_vlm2vec_phi3v.jinja ``` !!! important Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling` to run this model in embedding mode instead of text generation mode. The custom chat template is completely different from the original one for this model, and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja) Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: ??? code ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY", ) image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" response = create_chat_embeddings( client, model="TIGER-Lab/VLM2Vec-Full", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": image_url}}, {"type": "text", "text": "Represent the given image."}, ], } ], encoding_format="float", ) print("Image embedding output:", response.data[0].embedding) ``` === "DSE-Qwen2-MRL" To serve the model: ```bash vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \ --trust-remote-code \ --max-model-len 8192 \ --chat-template examples/template_dse_qwen2_vl.jinja ``` !!! important Like with VLM2Vec, we have to explicitly pass `--runner pooling`. Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja) !!! important `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code example below for details. Full example: [examples/pooling/embed/openai_chat_embedding_client_for_multimodal.py](../../examples/pooling/embed/openai_chat_embedding_client_for_multimodal.py) #### Extra parameters The following [pooling parameters][vllm.PoolingParams] are supported. ```python --8<-- "vllm/pooling_params.py:common-pooling-params" --8<-- "vllm/pooling_params.py:embedding-pooling-params" ``` The following extra parameters are supported by default: ??? code ```python --8<-- "vllm/entrypoints/pooling/embed/protocol.py:embedding-extra-params" ``` For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: ??? code ```python --8<-- "vllm/entrypoints/pooling/embed/protocol.py:chat-embedding-extra-params" ``` ### Transcriptions API Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. !!! note To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`. Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py) #### API Enforced Limits Set the maximum audio file size (in MB) that VLLM will accept, via the `VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` environment variable. Default is 25 MB. #### Uploading Audio Files The Transcriptions API supports uploading audio files in various formats including FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WEBM. **Using OpenAI Python Client:** ??? code ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123", ) # Upload audio file from disk with open("audio.mp3", "rb") as audio_file: transcription = client.audio.transcriptions.create( model="openai/whisper-large-v3-turbo", file=audio_file, language="en", response_format="verbose_json", ) print(transcription.text) ``` **Using curl with multipart/form-data:** ??? code ```bash curl -X POST "http://localhost:8000/v1/audio/transcriptions" \ -H "Authorization: Bearer token-abc123" \ -F "file=@audio.mp3" \ -F "model=openai/whisper-large-v3-turbo" \ -F "language=en" \ -F "response_format=verbose_json" ``` **Supported Parameters:** - `file`: The audio file to transcribe (required) - `model`: The model to use for transcription (required) - `language`: The language code (e.g., "en", "zh") (optional) - `prompt`: Optional text to guide the transcription style (optional) - `response_format`: Format of the response ("json", "text") (optional) - `temperature`: Sampling temperature between 0 and 1 (optional) For the complete list of supported parameters including sampling parameters and vLLM extensions, see the [protocol definitions](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L2182). **Response Format:** For `verbose_json` response format: ??? code ```json { "text": "Hello, this is a transcription of the audio file.", "language": "en", "duration": 5.42, "segments": [ { "id": 0, "seek": 0, "start": 0.0, "end": 2.5, "text": "Hello, this is a transcription", "tokens": [50364, 938, 428, 307, 275, 28347], "temperature": 0.0, "avg_logprob": -0.245, "compression_ratio": 1.235, "no_speech_prob": 0.012 } ] } ``` Currently “verbose_json” response format doesn’t support avg_logprob, compression_ratio, no_speech_prob. #### Extra Parameters The following [sampling parameters](../api/README.md#inference-parameters) are supported. ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params" ``` The following extra parameters are supported: ??? code ```python --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params" ``` ### Translations API Our Translation API is compatible with [OpenAI's Translations API](https://platform.openai.com/docs/api-reference/audio/createTranslation); you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. Whisper models can translate audio from one of the 55 non-English supported languages into English. Please mind that the popular `openai/whisper-large-v3-turbo` model does not support translating. !!! note To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`. Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py) #### Extra Parameters The following [sampling parameters](../api/README.md#inference-parameters) are supported. ```python --8<-- "vllm/entrypoints/openai/protocol.py:translation-sampling-params" ``` The following extra parameters are supported: ```python --8<-- "vllm/entrypoints/openai/protocol.py:translation-extra-params" ``` ### Tokenizer API Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer). It consists of two endpoints: - `/tokenize` corresponds to calling `tokenizer.encode()`. - `/detokenize` corresponds to calling `tokenizer.decode()`. ### Pooling API Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states. The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. Code example: [examples/pooling/pooling/openai_pooling_client.py](../../examples/pooling/pooling/openai_pooling_client.py) ### Classification API Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach). We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities. Code example: [examples/pooling/classify/openai_classification_client.py](../../examples/pooling/classify/openai_classification_client.py) #### Example Requests You can classify multiple texts by passing an array of strings: ```bash curl -v "http://127.0.0.1:8000/classify" \ -H "Content-Type: application/json" \ -d '{ "model": "jason9693/Qwen2.5-1.5B-apeach", "input": [ "Loved the new café—coffee was great.", "This update broke everything. Frustrating." ] }' ``` ??? console "Response" ```json { "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2", "object": "list", "created": 1745383065, "model": "jason9693/Qwen2.5-1.5B-apeach", "data": [ { "index": 0, "label": "Default", "probs": [ 0.565970778465271, 0.4340292513370514 ], "num_classes": 2 }, { "index": 1, "label": "Spoiled", "probs": [ 0.26448777318000793, 0.7355121970176697 ], "num_classes": 2 } ], "usage": { "prompt_tokens": 20, "total_tokens": 20, "completion_tokens": 0, "prompt_tokens_details": null } } ``` You can also pass a string directly to the `input` field: ```bash curl -v "http://127.0.0.1:8000/classify" \ -H "Content-Type: application/json" \ -d '{ "model": "jason9693/Qwen2.5-1.5B-apeach", "input": "Loved the new café—coffee was great." }' ``` ??? console "Response" ```json { "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682", "object": "list", "created": 1745383213, "model": "jason9693/Qwen2.5-1.5B-apeach", "data": [ { "index": 0, "label": "Default", "probs": [ 0.565970778465271, 0.4340292513370514 ], "num_classes": 2 } ], "usage": { "prompt_tokens": 10, "total_tokens": 10, "completion_tokens": 0, "prompt_tokens_details": null } } ``` #### Extra parameters The following [pooling parameters][vllm.PoolingParams] are supported. ```python --8<-- "vllm/pooling_params.py:common-pooling-params" --8<-- "vllm/pooling_params.py:classification-pooling-params" ``` The following extra parameters are supported: ```python --8<-- "vllm/entrypoints/pooling/classify/protocol.py:classification-extra-params" ``` ### Score API Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair. Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1. You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). Code example: [examples/pooling/score/openai_cross_encoder_score.py](../../examples/pooling/score/openai_cross_encoder_score.py) #### Score Template Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](#chat-template)). Score templates are supported for **cross-encoder** models only. If you are using an **embedding** model for scoring, vLLM does not apply a score template. Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter: - **Query**: `{{ (messages | selectattr("role", "eq", "query") | first).content }}` - **Document**: `{{ (messages | selectattr("role", "eq", "document") | first).content }}` This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future. Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja) #### Single inference You can pass a string to both `text_1` and `text_2`, forming a single sentence pair. ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "encoding_format": "float", "text_1": "What is the capital of France?", "text_2": "The capital of France is Paris." }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693447, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 1 } ], "usage": {} } ``` #### Batch inference You can pass a string to `text_1` and a list to `text_2`, forming multiple sentence pairs where each pair is built from `text_1` and a string in `text_2`. The total number of pairs is `len(text_2)`. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris." ] }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693570, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 0.001094818115234375 }, { "index": 1, "object": "score", "score": 1 } ], "usage": {} } ``` You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`). The total number of pairs is `len(text_2)`. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "encoding_format": "float", "text_1": [ "What is the capital of Brazil?", "What is the capital of France?" ], "text_2": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris." ] }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693447, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 1 }, { "index": 1, "object": "score", "score": 1 } ], "usage": {} } ``` #### Multi-modal inputs You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration. === "JinaVL-Reranker" To serve the model: ```bash vllm serve jinaai/jina-reranker-m0 ``` Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: ??? Code ```python import requests response = requests.post( "http://localhost:8000/v1/score", json={ "model": "jinaai/jina-reranker-m0", "text_1": "slm markdown", "text_2": { "content": [ { "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png" }, }, { "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png" }, }, ], }, }, ) response.raise_for_status() response_json = response.json() print("Scoring output:", response_json["data"][0]["score"]) print("Scoring output:", response_json["data"][1]["score"]) ``` Full example: [examples/pooling/score/openai_cross_encoder_score_for_multimodal.py](../../examples/pooling/score/openai_cross_encoder_score_for_multimodal.py) #### Extra parameters The following [pooling parameters][vllm.PoolingParams] are supported. ```python --8<-- "vllm/pooling_params.py:common-pooling-params" --8<-- "vllm/pooling_params.py:classification-pooling-params" ``` The following extra parameters are supported: ```python --8<-- "vllm/entrypoints/pooling/score/protocol.py:score-extra-params" ``` ### Re-rank API Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences or multi-modal inputs (image, etc.), on a scale of 0 to 1. You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the `score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank` endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and [Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with popular open-source tools. Code example: [examples/pooling/score/openai_reranker.py](../../examples/pooling/score/openai_reranker.py) #### Example Request Note that the `top_n` request parameter is optional and will default to the length of the `documents` field. Result documents will be sorted by relevance, and the `index` property can be used to determine original order. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/v1/rerank' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-base", "query": "What is the capital of France?", "documents": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris.", "Horses and cows are both animals" ] }' ``` ??? console "Response" ```json { "id": "rerank-fae51b2b664d4ed38f5969b612edff77", "model": "BAAI/bge-reranker-base", "usage": { "total_tokens": 56 }, "results": [ { "index": 1, "document": { "text": "The capital of France is Paris." }, "relevance_score": 0.99853515625 }, { "index": 0, "document": { "text": "The capital of Brazil is Brasilia." }, "relevance_score": 0.0005860328674316406 } ] } ``` #### Extra parameters The following [pooling parameters][vllm.PoolingParams] are supported. ```python --8<-- "vllm/pooling_params.py:common-pooling-params" --8<-- "vllm/pooling_params.py:classification-pooling-params" ``` The following extra parameters are supported: ```python --8<-- "vllm/entrypoints/pooling/score/protocol.py:rerank-extra-params" ``` ## Ray Serve LLM Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure. Key capabilities: - Exposes an OpenAI-compatible HTTP API as well as a Pythonic API. - Scales from a single GPU to a multi-node cluster without code changes. - Provides observability and autoscaling policies through Ray dashboards and metrics. The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py). Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html). --- # Parallelism and Scaling ## Distributed inference strategies for a single-model replica To choose a distributed inference strategy for a single-model replica, use the following guidelines: - **Single GPU (no distributed inference):** if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU. - **Single-node multi-GPU using tensor parallel inference:** if the model is too large for a single GPU but fits on a single node with multiple GPUs, use *tensor parallelism*. For example, set `tensor_parallel_size=4` when using a node with 4 GPUs. - **Multi-node multi-GPU using tensor parallel and pipeline parallel inference:** if the model is too large for a single node, combine *tensor parallelism* with *pipeline parallelism*. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. For example, set `tensor_parallel_size=8` and `pipeline_parallel_size=2` when using 2 nodes with 8 GPUs per node. Increase the number of GPUs and nodes until there is enough GPU memory for the model. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. After you provision sufficient resources to fit the model, run `vllm`. Look for log messages like: ```text INFO 07-23 13:56:04 [kv_cache_utils.py:775] GPU KV cache size: 643,232 tokens INFO 07-23 13:56:04 [kv_cache_utils.py:779] Maximum concurrency for 40,960 tokens per request: 15.70x ``` The `GPU KV cache size` line reports the total number of tokens that can be stored in the GPU KV cache at once. The `Maximum concurrency` line provides an estimate of how many requests can be served concurrently if each request requires the specified number of tokens (40,960 in the example above). The tokens-per-request number is taken from the model configuration's maximum sequence length, `ModelConfig.max_model_len`. If these numbers are lower than your throughput requirements, add more GPUs or nodes to your cluster. !!! note "Edge case: uneven GPU splits" If the model fits within a single node but the GPU count doesn't evenly divide the model size, enable pipeline parallelism, which splits the model along layers and supports uneven splits. In this scenario, set `tensor_parallel_size=1` and `pipeline_parallel_size` to the number of GPUs. Furthermore, if the GPUs on the node do not have NVLINK interconnect (e.g. L40S), leverage pipeline parallelism instead of tensor parallelism for higher throughput and lower communication overhead. ### Distributed serving of *Mixture of Experts* (*MoE*) models It's often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. For more information, see [Data Parallel Deployment](data_parallel_deployment.md). ## Single-node deployment vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. The implementation includes [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). The default distributed runtimes are [Ray](https://github.com/ray-project/ray) for multi-node inference and native Python `multiprocessing` for single-node inference. You can override the defaults by setting `distributed_executor_backend` in the `LLM` class or `--distributed-executor-backend` in the API server. Use `mp` for `multiprocessing` or `ray` for Ray. For multi-GPU inference, set `tensor_parallel_size` in the `LLM` class to the desired GPU count. For example, to run inference on 4 GPUs: ```python from vllm import LLM llm = LLM("facebook/opt-13b", tensor_parallel_size=4) output = llm.generate("San Francisco is a") ``` For multi-GPU serving, include `--tensor-parallel-size` when starting the server. For example, to run the API server on 4 GPUs: ```bash vllm serve facebook/opt-13b \ --tensor-parallel-size 4 ``` To enable pipeline parallelism, add `--pipeline-parallel-size`. For example, to run the API server on 8 GPUs with pipeline parallelism and tensor parallelism: ```bash # Eight GPUs total vllm serve gpt2 \ --tensor-parallel-size 4 \ --pipeline-parallel-size 2 ``` ## Multi-node deployment If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity. ### What is Ray? Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine. vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens. Ray also offers high-level APIs for large-scale [offline batch inference](https://docs.ray.io/en/latest/data/working-with-llms.html) and [online serving](https://docs.ray.io/en/latest/serve/llm) that can leverage vLLM as the engine. These APIs add production-grade fault tolerance, scaling, and distributed observability to vLLM workloads. For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.html). ### Ray cluster setup with containers The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. Choose one node as the head node and run: ```bash bash run_cluster.sh \ vllm/vllm-openai \ \ --head \ /path/to/the/huggingface/home/in/this/node \ -e VLLM_HOST_IP= ``` On each worker node, run: ```bash bash run_cluster.sh \ vllm/vllm-openai \ \ --worker \ /path/to/the/huggingface/home/in/this/node \ -e VLLM_HOST_IP= ``` Note that `VLLM_HOST_IP` is unique for each worker. Keep the shells running these commands open; closing any shell terminates the cluster. Ensure that all nodes can communicate with each other through their IP addresses. !!! warning "Network security" For security, set `VLLM_HOST_IP` to an address on a private network segment. Traffic sent over this network is unencrypted, and the endpoints exchange data in a format that can be exploited to execute arbitrary code if an adversary gains network access. Ensure that untrusted parties cannot reach the network. From any node, enter a container and run `ray status` and `ray list nodes` to verify that Ray finds the expected number of nodes and GPUs. !!! tip Alternatively, set up the Ray cluster using KubeRay. For more information, see [KubeRay vLLM documentation](https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayserve-llm-example.html). ### Running vLLM on a Ray cluster !!! tip If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs across 2 nodes (8 GPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2: ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 \ --pipeline-parallel-size 2 \ --distributed-executor-backend ray ``` Alternatively, you can set `tensor_parallel_size` to the total number of GPUs in the cluster: ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 16 \ --distributed-executor-backend ray ``` ### Running vLLM with MultiProcessing Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`. Choose one node as the head node and run: ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 --pipeline-parallel-size 2 \ --nnodes 2 --node-rank 0 \ --master-addr ``` On the other worker node, run: ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 --pipeline-parallel-size 2 \ --nnodes 2 --node-rank 1 \ --master-addr --headless ``` ## Optimizing network communication for tensor parallelism Efficient tensor parallelism requires fast internode communication, preferably through high-speed network adapters such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script. Contact your system administrator for more information about the required flags. ## Enabling GPUDirect RDMA GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that allows network adapters to directly access GPU memory, bypassing the CPU and system memory. This direct access reduces latency and CPU overhead, which is beneficial for large data transfers between GPUs across nodes. To enable GPUDirect RDMA with vLLM, configure the following settings: - `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. - Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). If you use Docker, set up the container as follows: ```bash docker run --gpus all \ --ipc=host \ --shm-size=16G \ -v /dev/shm:/dev/shm \ vllm/vllm-openai ``` If you use Kubernetes, set up the pod spec as follows: ```yaml ... spec: containers: - name: vllm image: vllm/vllm-openai securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/shm name: dshm resources: limits: nvidia.com/gpu: 8 requests: nvidia.com/gpu: 8 volumes: - name: dshm emptyDir: medium: Memory ... ``` !!! tip "Confirm GPUDirect RDMA operation" To confirm your InfiniBand card is using GPUDirect RDMA, run vLLM with detailed NCCL logs: `NCCL_DEBUG=TRACE vllm serve ...`. Then look for the NCCL version and the network used. - If you find `[send] via NET/IB/GDRDMA` in the logs, then NCCL is using InfiniBand with GPUDirect RDMA, which *is* efficient. - If you find `[send] via NET/Socket` in the logs, NCCL used a raw TCP socket, which *is not* efficient for cross-node tensor parallelism. !!! tip "Pre-download Hugging Face models" If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. ## Troubleshooting distributed deployments For information about distributed debugging, see [Troubleshooting distributed deployments](distributed_troubleshooting.md). --- # Reinforcement Learning from Human Feedback Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF. The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive): - [Cosmos-RL](https://github.com/nvidia-cosmos/cosmos-rl) - [ms-swift](https://github.com/modelscope/ms-swift/tree/main) - [NeMo-RL](https://github.com/NVIDIA-NeMo/RL) - [Open Instruct](https://github.com/allenai/open-instruct) - [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) - [PipelineRL](https://github.com/ServiceNow/PipelineRL) - [Prime-RL](https://github.com/PrimeIntellect-ai/prime-rl) - [SkyRL](https://github.com/NovaSky-AI/SkyRL) - [TRL](https://github.com/huggingface/trl) - [Unsloth](https://github.com/unslothai/unsloth) - [verl](https://github.com/volcengine/verl) See the following basic examples to get started if you don't want to use an existing library: - [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md) - [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md) - [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md) See the following notebooks showing how to use vLLM for GRPO: - [Efficient Online Training with GRPO and vLLM in TRL](https://huggingface.co/learn/cookbook/grpo_vllm_online_training) - [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) --- # Transformers Reinforcement Learning [Transformers Reinforcement Learning](https://huggingface.co/docs/trl) (TRL) is a full stack library that provides a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more. The library is integrated with 🤗 transformers. Online methods such as GRPO or Online DPO require the model to generate completions. vLLM can be used to generate these completions! See the [vLLM integration guide](https://huggingface.co/docs/trl/main/en/vllm_integration) in the TRL documentation for more information. TRL currently supports the following online trainers with vLLM: - [GRPO](https://huggingface.co/docs/trl/main/en/grpo_trainer) - [Online DPO](https://huggingface.co/docs/trl/main/en/online_dpo_trainer) - [RLOO](https://huggingface.co/docs/trl/main/en/rloo_trainer) - [Nash-MD](https://huggingface.co/docs/trl/main/en/nash_md_trainer) - [XPO](https://huggingface.co/docs/trl/main/en/xpo_trainer) To enable vLLM in TRL, set the `use_vllm` flag in the trainer configuration to `True`. ## Modes of Using vLLM During Training TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**. You can control how vLLM operates during training with the `vllm_mode` parameter. ### Server mode In **server mode**, vLLM runs as an independent process on dedicated GPUs and communicates with the trainer through HTTP requests. This configuration is ideal when you have separate GPUs for inference, as it isolates generation workloads from training, ensuring stable performance and easier scaling. ```python from trl import GRPOConfig training_args = GRPOConfig( ..., use_vllm=True, vllm_mode="server", # default value, can be omitted ) ``` ### Colocate mode In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. ```python from trl import GRPOConfig training_args = GRPOConfig( ..., use_vllm=True, vllm_mode="colocate", ) ``` Some trainers also support **vLLM sleep mode**, which offloads parameters and caches to GPU RAM during training, helping reduce memory usage. Learn more in the [memory optimization docs](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode). !!! info For detailed configuration options and flags, refer to the documentation of the specific trainer you are using. --- # Frequently Asked Questions > Q: How can I serve multiple models on a single port using the OpenAI API? A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly. --- > Q: Which model to use for offline inference embedding? A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5); more are listed [here](../models/supported_models.md). By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models, but they are expected to be inferior to models that are specifically trained on embedding tasks. --- > Q: Can the output of a prompt vary across runs in vLLM? A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details, see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations). In vLLM, the same requests might be batched differently due to factors such as other concurrent requests, changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations, can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in different tokens being sampled. Once a different token is sampled, further divergence is likely. ## Mitigation Strategies - For improved stability and reduced variance, use `float32`. Note that this will require more memory. - If using `bfloat16`, switching to `float16` can also help. - Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur. --- # Production Metrics vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed via the `/metrics` endpoint on the vLLM OpenAI compatible API server. You can start the server using Python, or using [Docker](../deployment/docker.md): ```bash vllm serve unsloth/Llama-3.2-1B-Instruct ``` Then query the endpoint to get the latest metrics from the server: ??? console "Output" ```console $ curl http://0.0.0.0:8000/metrics # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. # TYPE vllm:iteration_tokens_total histogram vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0 vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 ... ``` The following metrics are exposed: ## General Metrics --8<-- "docs/generated/metrics/general.inc.md" ## Speculative Decoding Metrics --8<-- "docs/generated/metrics/spec_decode.inc.md" ## NIXL KV Connector Metrics --8<-- "docs/generated/metrics/nixl_connector.inc.md" ## Deprecation Policy Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1` but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch, and are then removed in version `X.Y+2`. --- # Reproducibility vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve reproducible results: - In offline mode, you can either set `VLLM_ENABLE_V1_MULTIPROCESSING=0` which makes scheduling deterministic, or enable [batch invariance](../features/batch_invariance.md) to make the outputs insensitive to scheduling. - In online mode, you can only enable [batch invariance](../features/batch_invariance.md). Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py) !!! warning Setting `VLLM_ENABLE_V1_MULTIPROCESSING=0` will change the random state of user code (i.e. the code that constructs [LLM][vllm.LLM] class). !!! note Even with the above settings, vLLM only provides reproducibility when it runs on the same hardware and the same vLLM version. ## Setting the global seed The `seed` parameter in vLLM is used to control the random states for various random number generators. If a specific seed value is provided, the random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly. ### Default Behavior In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`. It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs for workflows such as speculative decoding. For more information, see: !!! note The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM only if the workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`. By default, `VLLM_ENABLE_V1_MULTIPROCESSING=1` so you can use vLLM without having to worry about accidentally making deterministic subsequent operations that rely on random state. --- # Security ## Inter-Node Communication All communications between nodes in a multi-node vLLM deployment are **insecure by default** and must be protected by placing the nodes on an isolated network. This includes: 1. PyTorch Distributed communications 2. KV cache transfer communications 3. Tensor, Pipeline, and Data parallel communications ### Configuration Options for Inter-Node Communications The following options control internode communications in vLLM: #### 1. **Environment Variables:** - `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on #### 2. **KV Cache Transfer Configuration:** - `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1) - `--kv-port`: The port for KV cache transfer communications (default: 14579) #### 3. **Data Parallel Configuration:** - `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1) - `data_parallel_master_port`: Port of the data parallel master (default: 29500) ### Notes on PyTorch Distributed vLLM uses PyTorch's distributed features for some internode communication. For detailed information about PyTorch Distributed security considerations, please refer to the [PyTorch Security Guide](https://github.com/pytorch/pytorch/security/policy#using-distributed-features). Key points from the PyTorch security guide: - PyTorch Distributed features are intended for internal communication only - They are not built for use in untrusted environments or networks - No authorization protocol is included for performance reasons - Messages are sent unencrypted - Connections are accepted from anywhere without checks ### Security Recommendations #### 1. **Network Isolation:** - Deploy vLLM nodes on a dedicated, isolated network - Use network segmentation to prevent unauthorized access - Implement appropriate firewall rules #### 2. **Configuration Best Practices:** - Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults - Configure firewalls to only allow necessary ports between nodes #### 3. **Access Control:** - Restrict physical and network access to the deployment environment - Implement proper authentication and authorization for management interfaces - Follow the principle of least privilege for all system components ### 4. **Restrict Domains Access for Media URLs:** Restrict domains that vLLM can access for media URLs by setting `--allowed-media-domains` to prevent Server-Side Request Forgery (SSRF) attacks. (e.g. `--allowed-media-domains upload.wikimedia.org github.com www.bogotobogo.com`) Also, consider setting `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to prevent HTTP redirects from being followed to bypass domain restrictions. ## Security and Firewalls: Protecting Exposed vLLM Systems While vLLM is designed to allow unsafe network services to be isolated to private networks, there are components—such as dependencies and underlying frameworks—that may open insecure services listening on all network interfaces, sometimes outside of vLLM's direct control. A major concern is the use of `torch.distributed`, which vLLM leverages for distributed communication, including when using vLLM on a single host. When vLLM uses TCP initialization (see [PyTorch TCP Initialization documentation](https://docs.pytorch.org/docs/stable/distributed.html#tcp-initialization)), PyTorch creates a `TCPStore` that, by default, listens on all network interfaces. This means that unless additional protections are put in place, these services may be accessible to any host that can reach your machine via any network interface. **From a PyTorch perspective, any use of `torch.distributed` should be considered insecure by default.** This is a known and intentional behavior from the PyTorch team. ### Firewall Configuration Guidance The best way to protect your vLLM system is to carefully configure a firewall to expose only the minimum network surface area necessary. In most cases, this means: - **Block all incoming connections except to the TCP port the API server is listening on.** - Ensure that ports used for internal communication (such as those for `torch.distributed` and KV cache transfer) are only accessible from trusted hosts or networks. - Never expose these internal ports to the public internet or untrusted networks. Consult your operating system or application platform documentation for specific firewall configuration instructions. ## API Key Authentication Limitations ### Overview The `--api-key` flag (or `VLLM_API_KEY` environment variable) provides authentication for vLLM's HTTP server, but **only for OpenAI-compatible API endpoints under the `/v1` path prefix**. Many other sensitive endpoints are exposed on the same HTTP server without any authentication enforcement. **Important:** Do not rely exclusively on `--api-key` for securing access to vLLM. Additional security measures are required for production deployments. ### Protected Endpoints (Require API Key) When `--api-key` is configured, the following `/v1` endpoints require Bearer token authentication: - `/v1/models` - List available models - `/v1/chat/completions` - Chat completions - `/v1/completions` - Text completions - `/v1/embeddings` - Generate embeddings - `/v1/audio/transcriptions` - Audio transcription - `/v1/audio/translations` - Audio translation - `/v1/messages` - Anthropic-compatible messages API - `/v1/responses` - Response management - `/v1/score` - Scoring API - `/v1/rerank` - Reranking API ### Unprotected Endpoints (No API Key Required) The following endpoints **do not require authentication** even when `--api-key` is configured: **Inference endpoints:** - `/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints) - `/inference/v1/generate` - Generate completions - `/pooling` - Pooling API - `/classify` - Classification API - `/score` - Scoring API (non-`/v1` variant) - `/rerank` - Reranking API (non-`/v1` variant) **Operational control endpoints (always enabled):** - `/pause` - Pause generation (causes denial of service) - `/resume` - Resume generation - `/scale_elastic_ep` - Trigger scaling operations **Utility endpoints:** - `/tokenize` - Tokenize text - `/detokenize` - Detokenize tokens - `/health` - Health check - `/ping` - SageMaker health check - `/version` - Version information - `/load` - Server load metrics **Tokenizer information endpoint (only when `--enable-tokenizer-info-endpoint` is set):** This endpoint is **only available when the `--enable-tokenizer-info-endpoint` flag is set**. It may expose sensitive information such as chat templates and tokenizer configuration: - `/tokenizer_info` - Get comprehensive tokenizer information including chat templates and configuration **Development endpoints (only when `VLLM_SERVER_DEV_MODE=1`):** These endpoints are **only available when the environment variable `VLLM_SERVER_DEV_MODE` is set to `1`**. They are intended for development and debugging purposes and should never be enabled in production: - `/server_info` - Get detailed server configuration - `/reset_prefix_cache` - Reset prefix cache (can disrupt service) - `/reset_mm_cache` - Reset multimodal cache (can disrupt service) - `/sleep` - Put engine to sleep (causes denial of service) - `/wake_up` - Wake engine from sleep - `/is_sleeping` - Check if engine is sleeping - `/collective_rpc` - Execute arbitrary RPC methods on the engine (extremely dangerous) **Profiler endpoints (only when `VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE` are set):** These endpoints are only available when profiling is enabled and should only be used for local development: - `/start_profile` - Start PyTorch profiler - `/stop_profile` - Stop PyTorch profiler **Note:** The `/invocations` endpoint is particularly concerning as it provides unauthenticated access to the same inference capabilities as the protected `/v1` endpoints. ### Security Implications An attacker who can reach the vLLM HTTP server can: 1. **Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials 2. **Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token 3. **Access operational controls** to manipulate server state (e.g., pausing generation) 4. **If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details 5. **If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration ### Recommended Security Practices #### 1. Minimize Exposed Endpoints **CRITICAL:** Never set `VLLM_SERVER_DEV_MODE=1` in production environments. Development endpoints expose extremely dangerous functionality including: - Arbitrary RPC execution via `/collective_rpc` - Cache manipulation that can disrupt service - Detailed server configuration disclosure Similarly, never enable profiler endpoints (`VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE`) in production. **Be cautious with `--enable-tokenizer-info-endpoint`:** Only enable the `/tokenizer_info` endpoint if you need to expose tokenizer configuration information. This endpoint reveals chat templates and tokenizer settings that may contain sensitive implementation details or prompt engineering strategies. #### 2. Deploy Behind a Reverse Proxy The most effective approach is to deploy vLLM behind a reverse proxy (such as nginx, Envoy, or a Kubernetes Gateway) that: - Explicitly allowlists only the endpoints you want to expose to end users - Blocks all other endpoints, including the unauthenticated inference and operational control endpoints - Implements additional authentication, rate limiting, and logging at the proxy layer ## Reporting Security Vulnerabilities If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md). --- # Troubleshooting This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. !!! note Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated. ## Hangs downloading a model If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection. It's recommended to download the model first using the [huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and passing the local path to the model to vLLM. This way, you can isolate the issue. ## Hangs loading a model from disk If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow. It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory. !!! note To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck. ## Out of memory If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](../configuration/conserving_memory.md) to reduce the memory consumption. ## Generation quality changed In v0.8.0, the source of default sampling parameters was changed in . Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator. In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance. You can check if this is happening by trying the old defaults with `--generation-config vllm` for online and `generation_config="vllm"` for offline. If, after trying this, your generation quality improves we would recommend continuing to use the vLLM defaults and petition the model creator on to update their default `generation_config.json` so that it produces better quality generations. ## Enable more logging If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue: - `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging. - `export VLLM_LOG_STATS_INTERVAL=1.` to get log statistics more frequently for tracking running queue, waiting queue and cache hit states. - `export CUDA_LAUNCH_BLOCKING=1` to identify which CUDA kernel is causing the problem. - `export NCCL_DEBUG=TRACE` to turn on more logging for NCCL. - `export VLLM_TRACE_FUNCTION=1` to record all function calls for inspection in the log files to tell which function crashes or hangs. (WARNING: This flag will slow down the token generation by **over 100x**. Do not use unless absolutely needed.) ## Breakpoints Setting normal `pdb` breakpoints may not work in vLLM's codebase if they are executed in a subprocess. You will experience something like: ``` text File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/bdb.py", line 100, in trace_dispatch return self.dispatch_line(frame) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/bdb.py", line 125, in dispatch_line if self.quitting: raise BdbQuit ^^^^^^^^^^^^^ bdb.BdbQuit ``` One solution is using [forked-pdb](https://github.com/Lightning-AI/forked-pdb). Install with `pip install fpdb` and set a breakpoint with something like: ``` python __import__('fpdb').ForkedPdb().set_trace() ``` Another option is to disable multiprocessing entirely, with the `VLLM_ENABLE_V1_MULTIPROCESSING` environment variable. This keeps the scheduler in the same process, so you can use stock `pdb` breakpoints: ``` python import os os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" ``` ## Incorrect network setup The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as `DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl` and the IP address should be the correct one. If it's not, override the IP address using the environment variable `export VLLM_HOST_IP=`. You might also need to set `export NCCL_SOCKET_IFNAME=` and `export GLOO_SOCKET_IFNAME=` to specify the network interface for the IP address. ## Error near `self.graph.replay()` If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph. To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error. ## Incorrect hardware/driver If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. ??? code ```python # Test PyTorch NCCL import torch import torch.distributed as dist dist.init_process_group(backend="nccl") local_rank = dist.get_rank() % torch.cuda.device_count() torch.cuda.set_device(local_rank) data = torch.FloatTensor([1,] * 128).to("cuda") dist.all_reduce(data, op=dist.ReduceOp.SUM) torch.cuda.synchronize() value = data.mean().item() world_size = dist.get_world_size() assert value == world_size, f"Expected {world_size}, got {value}" print("PyTorch NCCL is successful!") # Test PyTorch GLOO gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo") cpu_data = torch.FloatTensor([1,] * 128) dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group) value = cpu_data.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("PyTorch GLOO is successful!") if world_size <= 1: exit() # Test vLLM NCCL, with cuda graph from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) # pynccl is enabled by default for 0.6.5+, # but for 0.6.4 and below, we need to enable it manually. # keep the code for backward compatibility when because people # prefer to read the latest documentation. pynccl.disabled = False s = torch.cuda.Stream() with torch.cuda.stream(s): data.fill_(1) out = pynccl.all_reduce(data, stream=s) value = out.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("vLLM NCCL is successful!") g = torch.cuda.CUDAGraph() with torch.cuda.graph(cuda_graph=g, stream=s): out = pynccl.all_reduce(data, stream=torch.cuda.current_stream()) data.fill_(1) g.replay() torch.cuda.current_stream().synchronize() value = out.mean().item() assert value == world_size, f"Expected {world_size}, got {value}" print("vLLM NCCL with cuda graph is successful!") dist.destroy_process_group(gloo_group) dist.destroy_process_group() ``` If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use: ```bash NCCL_DEBUG=TRACE torchrun --nproc-per-node= test.py ``` If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run: ```bash NCCL_DEBUG=TRACE torchrun --nnodes 2 \ --nproc-per-node=2 \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR test.py ``` If the script runs successfully, you should see the message `sanity check is successful!`. If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully. !!! note A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`. - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`. Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. ## Python multiprocessing ### `RuntimeError` Exception If you have seen a warning in your logs like this: ```console WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. ``` or an error from Python that looks like this: ??? console "Logs" ```console RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.html ``` then you must update your Python code to guard usage of `vllm` behind a `if __name__ == '__main__':` block. For example, instead of this: ```python import vllm llm = vllm.LLM(...) ``` try this instead: ```python if __name__ == '__main__': import vllm llm = vllm.LLM(...) ``` ## `torch.compile` Error vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: ??? code ```python import torch @torch.compile def f(x): # a simple function to test torch.compile x = x + 1 x = x * 2 x = x.sin() return x x = torch.randn(4, 4).cuda() print(f(x)) ``` If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See for example. ## Model failed to be inspected If you see an error like: ```text File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported raise ValueError( ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details. ``` It means that vLLM failed to import the model file. Usually, it is related to missing dependencies or outdated binaries in the vLLM build. Please read the logs carefully to determine the root cause of the error. ## Model not supported If you see an error like: ```text Traceback (most recent call last): ... File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls for arch in architectures: TypeError: 'NoneType' object is not iterable ``` or: ```text File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported raise ValueError( ValueError: Model architectures [''] are not supported for now. Supported architectures: [...] ``` But you are sure that the model is in the [list of supported models](../models/supported_models.md), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model. ## Failed to infer device type If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](../../vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](https://github.com/vllm-project/vllm/pull/14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue. ## NCCL error: unhandled system error during `ncclCommInitRank` If your serving workload uses GPUDirect RDMA for distributed serving across multiple nodes and encounters an error during `ncclCommInitRank`, with no clear error message even with `NCCL_DEBUG=INFO` set, it might look like this: ```text Error executing method 'init_device'. This might cause deadlock in distributed execution. Traceback (most recent call last): ... File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__ self.comm: ncclComm_t = self.nccl.ncclCommInitRank( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK raise RuntimeError(f"NCCL error: {error_str}") RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) ... ``` This indicates vLLM failed to initialize the NCCL communicator, possibly due to a missing `IPC_LOCK` linux capability or an unmounted `/dev/shm`. Refer to [Enabling GPUDirect RDMA](../serving/parallelism_scaling.md#enabling-gpudirect-rdma) for guidance on properly configuring the environment for GPUDirect RDMA. ## CUDA error: the provided PTX was compiled with an unsupported toolchain If you see an error like `RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.`, it means that the CUDA PTX in vLLM's wheels was compiled with a toolchain unsupported by your system. The released vLLM wheels have to be compiled with a specific version of CUDA toolkit, and the compiled code might fail to run on lower versions of CUDA drivers. Read [cuda compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/) for more details. The solution is to install `cuda-compat` package from your package manager. For example, on Ubuntu, you can run `sudo apt-get install cuda-compat-12-9`, and then add `export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH` to your `.bashrc` file. When successfully installed, you should see that the output of `nvidia-smi` will show `CUDA Version: 12.9`. Note that we use CUDA 12.9 as an example here, you may want to install a higher version of cuda-compat package in case vLLM's default CUDA version goes higher. ## ptxas fatal: Value 'sm_110a' is not defined for option 'gpu-name' If you use triton kernels with cuda 13, you might see an error like `ptxas fatal: Value 'sm_110a' is not defined for option 'gpu-name'`: ```text (EngineCore_0 pid=9492) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error (EngineCore_0 pid=9492) `ptxas` stderr: (EngineCore_0 pid=9492) ptxas fatal : Value 'sm_110a' is not defined for option 'gpu-name' (EngineCore_0 pid=9492) (EngineCore_0 pid=9492) Repro command: /home/jetson/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_110a /tmp/tmp95oy_b9d.ptx -o /tmp/tmp95oy_b9d.ptx.o (EngineCore_0 pid=9492) outputs = self.engine_core.get_output() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jetson/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 668, in get_output raise self._format_exception(outputs) from None vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. ``` It means that the ptxas in triton bundle not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead: ```shell export CUDA_HOME=/usr/local/cuda export TRITON_PTXAS_PATH="${CUDA_HOME}/bin/ptxas" export PATH="${CUDA_HOME}/bin:$PATH" ``` ## Known Issues - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759). - To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations. - In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version. --- # Usage Stats Collection vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information. A subset of the data, after cleaning and aggregation, will be publicly released for the community's benefit. For example, you can see the 2024 usage report [here](https://2024.vllm.ai). ## What data is collected? The list of data collected by the latest version of vLLM can be found here: [vllm/usage/usage_lib.py](../../vllm/usage/usage_lib.py) Here is an example as of v0.4.0: ??? console "Output" ```json { "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109", "provider": "GCP", "num_cpu": 24, "cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz", "cpu_family_model_stepping": "6,85,7", "total_memory": 101261135872, "architecture": "x86_64", "platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31", "gpu_count": 2, "gpu_type": "NVIDIA L4", "gpu_memory_per_device": 23580639232, "model_architecture": "OPTForCausalLM", "vllm_version": "0.3.2+cu123", "context": "LLM_CLASS", "log_time": 1711663373492490000, "source": "production", "dtype": "torch.float16", "tensor_parallel_size": 1, "block_size": 16, "gpu_memory_utilization": 0.9, "quantization": null, "kv_cache_dtype": "auto", "enable_lora": false, "enable_prefix_caching": false, "enforce_eager": false, "disable_custom_all_reduce": true } ``` You can preview the collected data by running the following command: ```bash tail ~/.config/vllm/usage_stats.json ``` ## Opting out You can opt out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file: ```bash # Any of the following methods can disable usage stats collection export VLLM_NO_USAGE_STATS=1 export DO_NOT_TRACK=1 mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track ``` --- # vLLM V1 !!! announcement We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details. If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack). vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design. Building on V0’s success, vLLM V1 retains the stable and proven components from V0 (such as the models, GPU kernels, and utilities). At the same time, it significantly re-architects the core systems, covering the scheduler, KV cache manager, worker, sampler, and API server, to provide a cohesive, maintainable framework that better accommodates continued growth and innovation. Specifically, V1 aims to: - Provide a **simple, modular, and easy-to-hack codebase**. - Ensure **high performance** with near-zero CPU overhead. - **Combine key optimizations** into a unified architecture. - Require **zero configs** by enabling features/optimizations by default. We see significant performance improvements from upgrading to V1 core engine, in particular for long context scenarios. Please see performance benchmark (To be added). For more details, check out the vLLM V1 blog post [vLLM V1: A Major Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025). This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1. ## Differences from V0 This section lists some differences in behavior between V0 and V1. ### Chunked Prefill Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics. ### CUDA Graphs CUDA graph capture takes up more memory in V1 than in V0. ### Semantic Changes to Logprobs #### Logprobs Calculation By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling. You can adjust this behavior by setting the `--logprobs-mode` flag. Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`. Raw means the values before applying any logit processors, like bad words. Processed means the values after applying all processors, including temperature and top_k/top_p. #### Prompt Logprobs with Prefix Caching While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs. ## Feature Support For each item, its support in vLLM V1 falls into one of the following states: - **🟢 Functional**: Fully operational with optimizations comparable to or better than V0. - **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs. - **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand. !!! note vLLM V1’s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases. The V1 scheduler supports multiple scheduling policies, including First-Come, First-Served (FCFS) and priority-based scheduling (where requests are processed based on assigned priority, with FCFS as a tie-breaker), configurable via the `--scheduling-policy` argument. ### Hardware | Hardware | Status | |------------------|-----------------------------------------------| | **NVIDIA** | 🟢 | | **AMD** | 🟢 | | **INTEL GPU** | 🟢 | | **TPU** | 🟢 | | **CPU** | 🟢 | !!! note More hardware platforms may be supported via plugins, e.g.: - [vllm-ascend](https://github.com/vllm-project/vllm-ascend) - [vllm-spyre](https://github.com/vllm-project/vllm-spyre) - [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi) - [vllm-openvino](https://github.com/vllm-project/vllm-openvino) Please check their corresponding repositories for more details. ### Models | Model Type | Status | |-----------------------------|-------------------------------------------------------------------------| | **Decoder-only Models** | 🟢 | | **Encoder-Decoder Models** | 🟢 (Whisper), 🔴 (Others) | | **Pooling Models** | 🟢 | | **Mamba Models** | 🟢 | | **Multimodal Models** | 🟢 | See below for the status of models that are not yet supported or have more features planned in V1. #### Pooling Models Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models. We are working on enabling prefix caching and chunked prefill for more categories of pooling models. #### Mamba Models Models using selective state-space mechanisms instead of standard transformer attention are supported. Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported. Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`). Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`, `Lfm2ForCausalLM`). Please note that prefix caching is not yet supported for any of the above models. #### Encoder-Decoder Models Whisper is supported. Other models requiring cross-attention between separate encoder and decoder (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`) are no longer supported. ### Features | Feature | Status | |---------------------------------------------|-----------------------------------------------------------------------------------| | **Prefix Caching** | 🟢 Functional | | **Chunked Prefill** | 🟢 Functional | | **LoRA** | 🟢 Functional | | **Logprobs Calculation** | 🟢 Functional | | **FP8 KV Cache** | 🟢 Functional | | **Spec Decode** | 🟢 Functional | | **Prompt Logprobs with Prefix Caching** | 🟢 Functional | | **Structured Output Alternative Backends** | 🟢 Functional | | **Concurrent Partial Prefills** | 🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003) | | **best_of** | 🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361) | | **Per-Request Logits Processors** | 🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360) | | **GPU <> CPU KV Cache Swapping** | 🔴 Removed | | **Request-level Structured Output Backend** | 🔴 Removed | !!! note vLLM V1’s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases. #### Removed Features As part of the major architectural rework in vLLM V1, several legacy features have been removed. ##### Sampling features - **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361). - **Per-Request Logits Processors**: In V0, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been removed. Instead, we now support **global logits processors** which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799). ##### KV Cache features - **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping to handle request preemptions. ##### Structured Output features - **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.