# Tensorrt Llm

> Reusable note sections for docs.

## Pages

- [Note_Sections](includes-note-sections.md): ..
- [trtllm-bench](commands-trtllm-bench.md): trtllm-bench
- [trtllm-build](commands-trtllm-build.md): trtllm-build
- [trtllm-eval](commands-trtllm-eval.md): trtllm-eval
- [trtllm-serve](commands-trtllm-serve.md): trtllm-serve
- [trtllm-serve](commands-trtllm-serve-trtllm-serve.md): trtllm-serve
- [Config_Table](deployment-guide-config-table.md): .. start-config-table-note
- [Model Recipes](deployment-guide.md): Model Recipes
- [Dynamo K8s Example](examples-dynamo-k8s-example.md): Dynamo K8s Example
- [Index](examples.md): =======================================================
- [Index](index.md): .. TensorRT LLM documentation master file, created by
- [Index](installation.md): .. _installation:
- [Index](legacy-performance-performance-tuning-guide.md): Performance Tuning Guide
- [Functionals](legacy-python-api-tensorrt-llmfunctional.md): Functionals
- [Layers](legacy-python-api-tensorrt-llmlayers.md): Layers
- [Models](legacy-python-api-tensorrt-llmmodels.md): Models
- [Plugin](legacy-python-api-tensorrt-llmplugin.md): Plugin
- [Quantization](legacy-python-api-tensorrt-llmquantization.md): Quantization
- [Runtime](legacy-python-api-tensorrt-llmruntime.md): Runtime
- [How to get best performance on DeepSeek-R1 in TensorRT LLM](blogs-best-perf-practice-on-deepseek-r1-in-tensorrt-llm.md): NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system wi...
- [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](blogs-falcon180b-h200.md): H200's large capacity & high memory bandwidth, paired with TensorRT LLM's
- [H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token](blogs-h100vsa100.md): :bangbang: :new: *NVIDIA H200 has been announced & is optimized on TensorRT LLM. Learn more about H200, & H100 compar...
- [H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM](blogs-h200launch.md): :loudspeaker: Note: The below data is using TensorRT LLM v0.5. There have been significant improvements in v0.6 & lat...
- [New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](blogs-xqa-kernel.md): XQA kernel provides optimization for [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/abs/2305.132...
- [Speed up inference with SOTA quantization techniques in TRT-LLM](blogs-quantization-in-trt-llm.md): The deployment and inference speed of LLMs are often impeded by limitations in memory capacity, memory bandwidth, and...
- [ADP Balance Strategy](blogs-tech-blog-blog10-adp-balance-strategy.md): By NVIDIA TensorRT LLM team
- [Create an API key at https://ngc.nvidia.com (if you don't have one)](blogs-tech-blog-blog11-gpt-oss-eagle3.md): This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It ...
- [Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly](blogs-tech-blog-blog12-combining-guided-decoding-and-speculative-decoding.md): *By NVIDIA TensorRT LLM Team and the XGrammar Team*
- [Inference Time Compute Implementation in TensorRT LLM](blogs-tech-blog-blog13-inference-time-compute-implementation-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team and UCSD Hao AI Lab
- [Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)](blogs-tech-blog-blog14-scaling-expert-parallelism-in-tensorrt-llm-part3.md): This blog post is a continuation of previous posts:
- [Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs](blogs-tech-blog-blog1-pushing-latency-boundaries-optimizing-deepseek-r1-performa.md): by NVIDIA TensorRT LLM team
- [DeepSeek R1 MTP Implementation and Optimization](blogs-tech-blog-blog2-deepseek-r1-mtp-implementation-and-optimization.md): by NVIDIA TensorRT LLM team
- [Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers](blogs-tech-blog-blog3-optimizing-deepseek-r1-throughput-on-nvidia-blackwell-gpus.md): By NVIDIA TensorRT LLM team
- [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](blogs-tech-blog-blog4-scaling-expert-parallelism-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team
- [Disaggregated Serving in TensorRT LLM](blogs-tech-blog-blog5-disaggregated-serving-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team
- [How to launch Llama4 Maverick + Eagle3 TensorRT LLM server](blogs-tech-blog-blog6-llama4-maverick-eagle-guide.md): Artificial Analysis has benchmarked the Llama4 Maverick with Eagle3 enabled TensorRT LLM server running at over [1000...
- [N-Gram Speculative Decoding in TensorRT LLM](blogs-tech-blog-blog7-ngram-performance-analysis-and-auto-enablement.md): N-Gram speculative decoding leverages the natural repetition in many LLM workloads. It splits previously seen text in...
- [Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)](blogs-tech-blog-blog8-scaling-expert-parallelism-in-tensorrt-llm-part2.md): This blog post continues our previous work on [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Impleme...
- [Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM](blogs-tech-blog-blog9-deploying-gpt-oss-on-trtllm.md): In the guide below, we will walk you through how to launch your own
- [Run benchmarking with `trtllm-serve`](commands-trtllm-serve-run-benchmark-with-trtllm-serve.md): TensorRT LLM provides the OpenAI-compatible API via`trtllm-serve`command.
- [Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-deepseek-r1-on-trtllm.md): This deployment guide provides step-by-step instructions for running the DeepSeek R1 model using TensorRT LLM with FP...
- [Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware](deployment-guide-deployment-guide-for-gpt-oss-on-trtllm.md): This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized ...
- [Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell](deployment-guide-deployment-guide-for-kimi-k2-thinking-on-trtllm.md): This is a quickstart guide for running the Kimi K2 Thinking model on TensorRT LLM. It focuses on a working setup with...
- [Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-llama33-70b-on-trtllm.md): This deployment guide provides step-by-step instructions for running the Llama 3.3-70B Instruct model using TensorRT ...
- [Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-llama4-scout-on-trtllm.md): This deployment guide provides step-by-step instructions for running the Llama-4-Scout-17B-16E-Instruct model using T...
- [Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-qwen3-next-on-trtllm.md): This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setu...
- [Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-qwen3-on-trtllm.md): This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup wit...
- [LLM API Change Guide](developer-guide-api-change.md): This guide explains how to modify and manage APIs in TensorRT LLM, focusing on the high-level LLM API.
- [Continuous Integration Overview](developer-guide-ci-overview.md): This page explains how TensorRT‑LLM's CI is organized and how individual tests map to Jenkins stages. Most stages exe...
- [Using Dev Containers](developer-guide-dev-containers.md): The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
- [Introduction to KV Cache Transmission](developer-guide-kv-transfer.md): This article provides a general overview of the components used for device-to-device transmission of KV cache, which ...
- [Architecture Overview](developer-guide-overview.md): The`LLM`class is a core entry point for the TensorRT LLM, providing a simplified`generate()`API for efficient lar...
- [Performance Analysis](developer-guide-perf-analysis.md): (perf-analysis)=
- [TensorRT LLM Benchmarking](developer-guide-perf-benchmarking.md): (perf-benchmarking)=
- [Overview](developer-guide-perf-overview.md): (perf-overview)=
- [LLM Common Customizations](examples-customization.md): TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the`LLM`instanc...
- [How to Change KV Cache Behavior](examples-kvcacheconfig.md): Set KV cache behavior by providing the optional```kv_cache_config argument```when you create the LLM engine. Consid...
- [How to Change Block Priorities](examples-kvcacheretentionconfig.md): You can change block priority by providing the optional```kv_cache_retention_config```argument when you submit a re...
- [Additional Outputs](features-additional-outputs.md): (additional-outputs)=
- [Multi-Head, Multi-Query, and Group-Query Attention](features-attention.md): (attention)=
- [Benchmarking with trtllm-bench](features-auto-deploy-advanced-benchmarking-with-trtllm-bench.md): AutoDeploy is integrated with the`trtllm-bench`performance benchmarking utility, enabling you to measure comprehens...
- [Example Run Script](features-auto-deploy-advanced-example-run.md): To build and run AutoDeploy example, use the`examples/auto_deploy/build_and_run_ad.py`script:
- [Expert Configuration of LLM API](features-auto-deploy-advanced-expert-configurations.md): For advanced TensorRT LLM users, the full set of`tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`is exposed. Use a...
- [Logging Level](features-auto-deploy-advanced-logging.md): Use the following env variable to specify the logging level of our built-in logger, ordered by
- [Construct the LLM high-level interface object with autodeploy as backend](features-auto-deploy-advanced-workflow.md): AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section prov...
- [AutoDeploy (Prototype)](features-auto-deploy-auto-deploy.md): This project is under active development and is currently in a prototype stage. The code is a prototype, subject to c...
- [Support_Matrix](features-auto-deploy-support-matrix.md): AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workf...
- [Checkpoint Loading](features-checkpoint-loading.md): The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different fo...
- [Disaggregated Serving](features-disagg-serving.md): - Motivation
- [Feature Combination Matrix](features-feature-combination-matrix.md): | Feature | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | ...
- [Guided Decoding](features-guided-decoding.md): Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are ...
- [Helix Parallelism](features-helix.md): Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional at...
- [KV Cache Connector](features-kv-cache-connector.md): The KV Cache Connector is a flexible interface in TensorRT-LLM that enables remote or external access to the Key-Valu...
- [KV Cache System](features-kvcache.md): The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calcu...
- [Long Sequences](features-long-sequence.md): In many real-world scenarios, such as long documents summarization or multi-turn conversations, LLMs are required to ...
- [LoRA (Low-Rank Adaptation)](features-lora.md): LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models...
- [Multimodal Support in TensorRT LLM](features-multi-modality.md): TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text.
- [Overlap Scheduler](features-overlap-scheduler.md): To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating respon...
- [Paged Attention, IFB, and Request Scheduling](features-paged-attention-ifb-scheduler.md): TensorRT LLM supports in-flight batching of requests (also known as continuous
- [Parallelism in TensorRT LLM](features-parallel-strategy.md): Parallelism across multiple GPUs becomes necessary when either
- [Quantization](features-quantization.md): Quantization is a technique used to reduce memory footprint and computational cost by converting the model's weights ...
- [Ray Orchestrator (Prototype)](features-ray-orchestrator.md): This project is under active development and currently in a prototype stage. The current focus is on core functionali...
- [Sampling](features-sampling.md): The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature...
- [Sparse Attention](features-sparse-attention.md): - Background and Motivation
- [Speculative Decoding](features-speculative-decoding.md): There are two flavors of speculative decoding currently supported in the PyTorch backend:
- [Torch Compile & Piecewise CUDA Graph](features-torch-compile-and-piecewise-cuda-graph.md): In this guide, we show how to enable torch.compile and Piecewise CUDA Graph in TensorRT LLM. TensorRT LLM uses torch....
- [Building from Source Code on Linux](installation-build-from-source-linux.md): (build-from-source-linux)=
- [Pre-built release container images on NGC](installation-containers.md): (containers)=
- [Installing on Linux via `pip`](installation-linux.md): (linux)=
- [Disaggregated-Service (Prototype)](legacy-advanced-disaggregated-service.md): (disaggregated-service)=
- [Executor API](legacy-advanced-executor.md): (executor)=
- [Expert Parallelism in TensorRT-LLM](legacy-advanced-expert-parallelism.md): (expert-parallelism)=
- [Multi-Head, Multi-Query, and Group-Query Attention](legacy-advanced-gpt-attention.md): (gpt-attention)=
- [C++ GPT Runtime](legacy-advanced-gpt-runtime.md): (gpt-runtime)=
- [Graph Rewriting Module](legacy-advanced-graph-rewriting.md): (graph-rewriting)=
- [KV Cache Management: Pools, Blocks, and Events](legacy-advanced-kv-cache-management.md): (kv-cache-management)=
- [KV cache reuse](legacy-advanced-kv-cache-reuse.md): (kv-cache-reuse)=
- [loraConfig](legacy-advanced-lora.md): (lora)=
- [Low-Precision-AllReduce](legacy-advanced-lowprecision-pcie-allreduce.md): Note:
- [Open Sourced Cutlass Kernels](legacy-advanced-open-sourced-cutlass-kernels.md): We have recently open-sourced a set of Cutlass kernels that were previously known as "internal_cutlass_kernels". Due ...
- [Speculative Sampling](legacy-advanced-speculative-decoding.md): - About Speculative Sampling
- [Convert model as normal. Assume hugging face model is in llama-7b-hf/](legacy-advanced-weight-streaming.md): (weight-streaming)=
- [Adding a Model](legacy-architecture-add-model.md): (add-model)=
- [TensorRT LLM Checkpoint](legacy-architecture-checkpoint.md): The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those vers...
- [Model Definition](legacy-architecture-core-concepts.md): (core-concepts)=
- [TensorRT-LLM Model Weights Loader](legacy-architecture-model-weights-loader.md): The weights loader is designed for easily converting and loading external weight checkpoints into TensorRT-LLM models.
- [TensorRT-LLM Build Workflow](legacy-architecture-workflow.md): The build workflow contains two major steps.
- [Build the TensorRT LLM Docker Image](legacy-dev-on-cloud-build-image-to-dockerhub.md): (build-image-to-dockerhub)=
- [Develop TensorRT LLM on Runpod](legacy-dev-on-cloud-dev-on-runpod.md): (dev-on-runpod)=
- [Key Features](legacy-key-features.md): This document lists key features supported in TensorRT-LLM.
- [Performance Analysis](legacy-performance-perf-analysis.md): (perf-analysis)=
- [TensorRT-LLM Benchmarking](legacy-performance-perf-benchmarking.md): (perf-benchmarking)=
- [Benchmarking Default Performance](legacy-performance-performance-tuning-guide-benchmarking-default-performance.md): (benchmarking-default-performance)=
- [Deciding Model Sharding Strategy](legacy-performance-performance-tuning-guide-deciding-model-sharding-strategy.md): (deciding-model-sharding-strategy)=
- [FP8 Quantization](legacy-performance-performance-tuning-guide-fp8-quantization.md): (fp8-quantization)=
- [Introduction](legacy-performance-performance-tuning-guide-introduction.md): While defaults are expected to provide solid performance, TensorRT-LLM has several configurable options that can impr...
- [Tuning Max Batch Size and Max Num Tokens](legacy-performance-performance-tuning-guide-tuning-max-batch-size-and-max-num-to.md): (tuning-max-batch-size-and-max-num-tokens)=
- [Useful Build-Time Flags](legacy-performance-performance-tuning-guide-useful-build-time-flags.md): (useful-build-time-flags)=
- [Useful Runtime Options](legacy-performance-performance-tuning-guide-useful-runtime-flags.md): (useful-runtime-flags)=
- [Memory Usage of TensorRT-LLM](legacy-reference-memory.md): (memory)=
- [Multimodal Feature Support Matrix (PyTorch Backend)](legacy-reference-multimodal-feature-support-matrix.md): | Model | CUDA Graph | Encoder IFB | KV Cache Reuse | Chunked Prefill |
- [Numerical Precision](legacy-reference-precision.md): (precision)=
- [Support Matrix](legacy-reference-support-matrix.md): TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide...
- [Troubleshooting](legacy-reference-troubleshooting.md): (troubleshooting)=
- [LLM API with TensorRT Engine](legacy-tensorrt-quickstart.md): A simple inference example with TinyLlama using the LLM API:
- [PyTorch Backend](legacy-torch.md): Note:
- [LLM API Introduction](llm-api.md): The LLM API is a high-level Python API designed to streamline LLM inference workflows.
- [Adding a New Model](models-adding-new-model.md): 1. Introduction
- [Supported Models](models-supported-models.md): (support-matrix)=
- [Overview](overview.md): (product-overview)=
- [Quick Start Guide](quick-start-guide.md): (quick-start-guide)=
- [Release Notes](release-notes.md): (release-notes)=
- [Adding a New Model in PyTorch Backend](torch-adding-new-model.md): 1. Introduction
- [Architecture Ovewiew](torch-arch-overview.md): TensorRT LLM is a toolkit designed to create optimized solutions for Large Language Model (LLM) inference.
- [Attention](torch-attention.md): (attention)=
- [Benchmarking with trtllm-bench](torch-auto-deploy-advanced-benchmarking-with-trtllm-bench.md): AutoDeploy is integrated with the`trtllm-bench`performance benchmarking utility, enabling you to measure comprehens...
- [Example Run Script](torch-auto-deploy-advanced-example-run.md): To build and run AutoDeploy example, use the`examples/auto_deploy/build_and_run_ad.py`script:
- [Expert Configuration of LLM API](torch-auto-deploy-advanced-expert-configurations.md): For advanced TensorRT-LLM users, the full set of`tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`is exposed. Use a...
- [Logging Level](torch-auto-deploy-advanced-logging.md): Use the following env variable to specify the logging level of our built-in logger, ordered by
- [Serving with trtllm-serve](torch-auto-deploy-advanced-serving-with-trtllm-serve.md): AutoDeploy integrates with the OpenAI-compatible`trtllm-serve`CLI so you can expose AutoDeploy-optimized models ove...
- [Construct the LLM high-level interface object with autodeploy as backend](torch-auto-deploy-advanced-workflow.md): AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section prov...
- [AutoDeploy](torch-auto-deploy-auto-deploy.md): This project is under active development and is currently in a prototype stage. The code is experimental, subject to ...
- [Support_Matrix](torch-auto-deploy-support-matrix.md): AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workf...
- [Checkpoint Loading](torch-features-checkpoint-loading.md): The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different so...
- [LoRA (Low-Rank Adaptation)](torch-features-lora.md): LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models...
- [Overlap Scheduler](torch-features-overlap-scheduler.md): To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating respon...
- [Quantization](torch-features-quantization.md): The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub,
- [Sampling](torch-features-sampling.md): The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature...
- [KV Cache Manager](torch-kv-cache-manager.md): In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to optimize decoding efficiency, particular...
- [Scheduler](torch-scheduler.md): TensorRT LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically a...