# Tensorrt Llm > Reusable note sections for docs. ## Pages - [Note_Sections](includes-note-sections.md): .. - [trtllm-bench](commands-trtllm-bench.md): trtllm-bench - [trtllm-build](commands-trtllm-build.md): trtllm-build - [trtllm-eval](commands-trtllm-eval.md): trtllm-eval - [trtllm-serve](commands-trtllm-serve.md): trtllm-serve - [trtllm-serve](commands-trtllm-serve-trtllm-serve.md): trtllm-serve - [Config_Table](deployment-guide-config-table.md): .. start-config-table-note - [Model Recipes](deployment-guide.md): Model Recipes - [Dynamo K8s Example](examples-dynamo-k8s-example.md): Dynamo K8s Example - [Index](examples.md): ======================================================= - [Index](index.md): .. TensorRT LLM documentation master file, created by - [Index](installation.md): .. _installation: - [Index](legacy-performance-performance-tuning-guide.md): Performance Tuning Guide - [Functionals](legacy-python-api-tensorrt-llmfunctional.md): Functionals - [Layers](legacy-python-api-tensorrt-llmlayers.md): Layers - [Models](legacy-python-api-tensorrt-llmmodels.md): Models - [Plugin](legacy-python-api-tensorrt-llmplugin.md): Plugin - [Quantization](legacy-python-api-tensorrt-llmquantization.md): Quantization - [Runtime](legacy-python-api-tensorrt-llmruntime.md): Runtime - [How to get best performance on DeepSeek-R1 in TensorRT LLM](blogs-best-perf-practice-on-deepseek-r1-in-tensorrt-llm.md): NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system wi... - [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](blogs-falcon180b-h200.md): H200's large capacity & high memory bandwidth, paired with TensorRT LLM's - [H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token](blogs-h100vsa100.md): :bangbang: :new: *NVIDIA H200 has been announced & is optimized on TensorRT LLM. Learn more about H200, & H100 compar... - [H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM](blogs-h200launch.md): :loudspeaker: Note: The below data is using TensorRT LLM v0.5. There have been significant improvements in v0.6 & lat... - [New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](blogs-xqa-kernel.md): XQA kernel provides optimization for [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/abs/2305.132... - [Speed up inference with SOTA quantization techniques in TRT-LLM](blogs-quantization-in-trt-llm.md): The deployment and inference speed of LLMs are often impeded by limitations in memory capacity, memory bandwidth, and... - [ADP Balance Strategy](blogs-tech-blog-blog10-adp-balance-strategy.md): By NVIDIA TensorRT LLM team - [Create an API key at https://ngc.nvidia.com (if you don't have one)](blogs-tech-blog-blog11-gpt-oss-eagle3.md): This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It ... - [Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly](blogs-tech-blog-blog12-combining-guided-decoding-and-speculative-decoding.md): *By NVIDIA TensorRT LLM Team and the XGrammar Team* - [Inference Time Compute Implementation in TensorRT LLM](blogs-tech-blog-blog13-inference-time-compute-implementation-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team and UCSD Hao AI Lab - [Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)](blogs-tech-blog-blog14-scaling-expert-parallelism-in-tensorrt-llm-part3.md): This blog post is a continuation of previous posts: - [Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs](blogs-tech-blog-blog1-pushing-latency-boundaries-optimizing-deepseek-r1-performa.md): by NVIDIA TensorRT LLM team - [DeepSeek R1 MTP Implementation and Optimization](blogs-tech-blog-blog2-deepseek-r1-mtp-implementation-and-optimization.md): by NVIDIA TensorRT LLM team - [Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers](blogs-tech-blog-blog3-optimizing-deepseek-r1-throughput-on-nvidia-blackwell-gpus.md): By NVIDIA TensorRT LLM team - [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](blogs-tech-blog-blog4-scaling-expert-parallelism-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team - [Disaggregated Serving in TensorRT LLM](blogs-tech-blog-blog5-disaggregated-serving-in-tensorrt-llm.md): By NVIDIA TensorRT LLM Team - [How to launch Llama4 Maverick + Eagle3 TensorRT LLM server](blogs-tech-blog-blog6-llama4-maverick-eagle-guide.md): Artificial Analysis has benchmarked the Llama4 Maverick with Eagle3 enabled TensorRT LLM server running at over [1000... - [N-Gram Speculative Decoding in TensorRT LLM](blogs-tech-blog-blog7-ngram-performance-analysis-and-auto-enablement.md): N-Gram speculative decoding leverages the natural repetition in many LLM workloads. It splits previously seen text in... - [Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)](blogs-tech-blog-blog8-scaling-expert-parallelism-in-tensorrt-llm-part2.md): This blog post continues our previous work on [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Impleme... - [Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM](blogs-tech-blog-blog9-deploying-gpt-oss-on-trtllm.md): In the guide below, we will walk you through how to launch your own - [Run benchmarking with `trtllm-serve`](commands-trtllm-serve-run-benchmark-with-trtllm-serve.md): TensorRT LLM provides the OpenAI-compatible API via`trtllm-serve`command. - [Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-deepseek-r1-on-trtllm.md): This deployment guide provides step-by-step instructions for running the DeepSeek R1 model using TensorRT LLM with FP... - [Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware](deployment-guide-deployment-guide-for-gpt-oss-on-trtllm.md): This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized ... - [Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell](deployment-guide-deployment-guide-for-kimi-k2-thinking-on-trtllm.md): This is a quickstart guide for running the Kimi K2 Thinking model on TensorRT LLM. It focuses on a working setup with... - [Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-llama33-70b-on-trtllm.md): This deployment guide provides step-by-step instructions for running the Llama 3.3-70B Instruct model using TensorRT ... - [Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-llama4-scout-on-trtllm.md): This deployment guide provides step-by-step instructions for running the Llama-4-Scout-17B-16E-Instruct model using T... - [Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-qwen3-next-on-trtllm.md): This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setu... - [Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware](deployment-guide-deployment-guide-for-qwen3-on-trtllm.md): This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup wit... - [LLM API Change Guide](developer-guide-api-change.md): This guide explains how to modify and manage APIs in TensorRT LLM, focusing on the high-level LLM API. - [Continuous Integration Overview](developer-guide-ci-overview.md): This page explains how TensorRT‑LLM's CI is organized and how individual tests map to Jenkins stages. Most stages exe... - [Using Dev Containers](developer-guide-dev-containers.md): The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/) - [Introduction to KV Cache Transmission](developer-guide-kv-transfer.md): This article provides a general overview of the components used for device-to-device transmission of KV cache, which ... - [Architecture Overview](developer-guide-overview.md): The`LLM`class is a core entry point for the TensorRT LLM, providing a simplified`generate()`API for efficient lar... - [Performance Analysis](developer-guide-perf-analysis.md): (perf-analysis)= - [TensorRT LLM Benchmarking](developer-guide-perf-benchmarking.md): (perf-benchmarking)= - [Overview](developer-guide-perf-overview.md): (perf-overview)= - [LLM Common Customizations](examples-customization.md): TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the`LLM`instanc... - [How to Change KV Cache Behavior](examples-kvcacheconfig.md): Set KV cache behavior by providing the optional```kv_cache_config argument```when you create the LLM engine. Consid... - [How to Change Block Priorities](examples-kvcacheretentionconfig.md): You can change block priority by providing the optional```kv_cache_retention_config```argument when you submit a re... - [Additional Outputs](features-additional-outputs.md): (additional-outputs)= - [Multi-Head, Multi-Query, and Group-Query Attention](features-attention.md): (attention)= - [Benchmarking with trtllm-bench](features-auto-deploy-advanced-benchmarking-with-trtllm-bench.md): AutoDeploy is integrated with the`trtllm-bench`performance benchmarking utility, enabling you to measure comprehens... - [Example Run Script](features-auto-deploy-advanced-example-run.md): To build and run AutoDeploy example, use the`examples/auto_deploy/build_and_run_ad.py`script: - [Expert Configuration of LLM API](features-auto-deploy-advanced-expert-configurations.md): For advanced TensorRT LLM users, the full set of`tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`is exposed. Use a... - [Logging Level](features-auto-deploy-advanced-logging.md): Use the following env variable to specify the logging level of our built-in logger, ordered by - [Construct the LLM high-level interface object with autodeploy as backend](features-auto-deploy-advanced-workflow.md): AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section prov... - [AutoDeploy (Prototype)](features-auto-deploy-auto-deploy.md): This project is under active development and is currently in a prototype stage. The code is a prototype, subject to c... - [Support_Matrix](features-auto-deploy-support-matrix.md): AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workf... - [Checkpoint Loading](features-checkpoint-loading.md): The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different fo... - [Disaggregated Serving](features-disagg-serving.md): - Motivation - [Feature Combination Matrix](features-feature-combination-matrix.md): | Feature | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | ... - [Guided Decoding](features-guided-decoding.md): Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are ... - [Helix Parallelism](features-helix.md): Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional at... - [KV Cache Connector](features-kv-cache-connector.md): The KV Cache Connector is a flexible interface in TensorRT-LLM that enables remote or external access to the Key-Valu... - [KV Cache System](features-kvcache.md): The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calcu... - [Long Sequences](features-long-sequence.md): In many real-world scenarios, such as long documents summarization or multi-turn conversations, LLMs are required to ... - [LoRA (Low-Rank Adaptation)](features-lora.md): LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models... - [Multimodal Support in TensorRT LLM](features-multi-modality.md): TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. - [Overlap Scheduler](features-overlap-scheduler.md): To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating respon... - [Paged Attention, IFB, and Request Scheduling](features-paged-attention-ifb-scheduler.md): TensorRT LLM supports in-flight batching of requests (also known as continuous - [Parallelism in TensorRT LLM](features-parallel-strategy.md): Parallelism across multiple GPUs becomes necessary when either - [Quantization](features-quantization.md): Quantization is a technique used to reduce memory footprint and computational cost by converting the model's weights ... - [Ray Orchestrator (Prototype)](features-ray-orchestrator.md): This project is under active development and currently in a prototype stage. The current focus is on core functionali... - [Sampling](features-sampling.md): The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature... - [Sparse Attention](features-sparse-attention.md): - Background and Motivation - [Speculative Decoding](features-speculative-decoding.md): There are two flavors of speculative decoding currently supported in the PyTorch backend: - [Torch Compile & Piecewise CUDA Graph](features-torch-compile-and-piecewise-cuda-graph.md): In this guide, we show how to enable torch.compile and Piecewise CUDA Graph in TensorRT LLM. TensorRT LLM uses torch.... - [Building from Source Code on Linux](installation-build-from-source-linux.md): (build-from-source-linux)= - [Pre-built release container images on NGC](installation-containers.md): (containers)= - [Installing on Linux via `pip`](installation-linux.md): (linux)= - [Disaggregated-Service (Prototype)](legacy-advanced-disaggregated-service.md): (disaggregated-service)= - [Executor API](legacy-advanced-executor.md): (executor)= - [Expert Parallelism in TensorRT-LLM](legacy-advanced-expert-parallelism.md): (expert-parallelism)= - [Multi-Head, Multi-Query, and Group-Query Attention](legacy-advanced-gpt-attention.md): (gpt-attention)= - [C++ GPT Runtime](legacy-advanced-gpt-runtime.md): (gpt-runtime)= - [Graph Rewriting Module](legacy-advanced-graph-rewriting.md): (graph-rewriting)= - [KV Cache Management: Pools, Blocks, and Events](legacy-advanced-kv-cache-management.md): (kv-cache-management)= - [KV cache reuse](legacy-advanced-kv-cache-reuse.md): (kv-cache-reuse)= - [loraConfig](legacy-advanced-lora.md): (lora)= - [Low-Precision-AllReduce](legacy-advanced-lowprecision-pcie-allreduce.md): Note: - [Open Sourced Cutlass Kernels](legacy-advanced-open-sourced-cutlass-kernels.md): We have recently open-sourced a set of Cutlass kernels that were previously known as "internal_cutlass_kernels". Due ... - [Speculative Sampling](legacy-advanced-speculative-decoding.md): - About Speculative Sampling - [Convert model as normal. Assume hugging face model is in llama-7b-hf/](legacy-advanced-weight-streaming.md): (weight-streaming)= - [Adding a Model](legacy-architecture-add-model.md): (add-model)= - [TensorRT LLM Checkpoint](legacy-architecture-checkpoint.md): The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those vers... - [Model Definition](legacy-architecture-core-concepts.md): (core-concepts)= - [TensorRT-LLM Model Weights Loader](legacy-architecture-model-weights-loader.md): The weights loader is designed for easily converting and loading external weight checkpoints into TensorRT-LLM models. - [TensorRT-LLM Build Workflow](legacy-architecture-workflow.md): The build workflow contains two major steps. - [Build the TensorRT LLM Docker Image](legacy-dev-on-cloud-build-image-to-dockerhub.md): (build-image-to-dockerhub)= - [Develop TensorRT LLM on Runpod](legacy-dev-on-cloud-dev-on-runpod.md): (dev-on-runpod)= - [Key Features](legacy-key-features.md): This document lists key features supported in TensorRT-LLM. - [Performance Analysis](legacy-performance-perf-analysis.md): (perf-analysis)= - [TensorRT-LLM Benchmarking](legacy-performance-perf-benchmarking.md): (perf-benchmarking)= - [Benchmarking Default Performance](legacy-performance-performance-tuning-guide-benchmarking-default-performance.md): (benchmarking-default-performance)= - [Deciding Model Sharding Strategy](legacy-performance-performance-tuning-guide-deciding-model-sharding-strategy.md): (deciding-model-sharding-strategy)= - [FP8 Quantization](legacy-performance-performance-tuning-guide-fp8-quantization.md): (fp8-quantization)= - [Introduction](legacy-performance-performance-tuning-guide-introduction.md): While defaults are expected to provide solid performance, TensorRT-LLM has several configurable options that can impr... - [Tuning Max Batch Size and Max Num Tokens](legacy-performance-performance-tuning-guide-tuning-max-batch-size-and-max-num-to.md): (tuning-max-batch-size-and-max-num-tokens)= - [Useful Build-Time Flags](legacy-performance-performance-tuning-guide-useful-build-time-flags.md): (useful-build-time-flags)= - [Useful Runtime Options](legacy-performance-performance-tuning-guide-useful-runtime-flags.md): (useful-runtime-flags)= - [Memory Usage of TensorRT-LLM](legacy-reference-memory.md): (memory)= - [Multimodal Feature Support Matrix (PyTorch Backend)](legacy-reference-multimodal-feature-support-matrix.md): | Model | CUDA Graph | Encoder IFB | KV Cache Reuse | Chunked Prefill | - [Numerical Precision](legacy-reference-precision.md): (precision)= - [Support Matrix](legacy-reference-support-matrix.md): TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide... - [Troubleshooting](legacy-reference-troubleshooting.md): (troubleshooting)= - [LLM API with TensorRT Engine](legacy-tensorrt-quickstart.md): A simple inference example with TinyLlama using the LLM API: - [PyTorch Backend](legacy-torch.md): Note: - [LLM API Introduction](llm-api.md): The LLM API is a high-level Python API designed to streamline LLM inference workflows. - [Adding a New Model](models-adding-new-model.md): 1. Introduction - [Supported Models](models-supported-models.md): (support-matrix)= - [Overview](overview.md): (product-overview)= - [Quick Start Guide](quick-start-guide.md): (quick-start-guide)= - [Release Notes](release-notes.md): (release-notes)= - [Adding a New Model in PyTorch Backend](torch-adding-new-model.md): 1. Introduction - [Architecture Ovewiew](torch-arch-overview.md): TensorRT LLM is a toolkit designed to create optimized solutions for Large Language Model (LLM) inference. - [Attention](torch-attention.md): (attention)= - [Benchmarking with trtllm-bench](torch-auto-deploy-advanced-benchmarking-with-trtllm-bench.md): AutoDeploy is integrated with the`trtllm-bench`performance benchmarking utility, enabling you to measure comprehens... - [Example Run Script](torch-auto-deploy-advanced-example-run.md): To build and run AutoDeploy example, use the`examples/auto_deploy/build_and_run_ad.py`script: - [Expert Configuration of LLM API](torch-auto-deploy-advanced-expert-configurations.md): For advanced TensorRT-LLM users, the full set of`tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`is exposed. Use a... - [Logging Level](torch-auto-deploy-advanced-logging.md): Use the following env variable to specify the logging level of our built-in logger, ordered by - [Serving with trtllm-serve](torch-auto-deploy-advanced-serving-with-trtllm-serve.md): AutoDeploy integrates with the OpenAI-compatible`trtllm-serve`CLI so you can expose AutoDeploy-optimized models ove... - [Construct the LLM high-level interface object with autodeploy as backend](torch-auto-deploy-advanced-workflow.md): AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section prov... - [AutoDeploy](torch-auto-deploy-auto-deploy.md): This project is under active development and is currently in a prototype stage. The code is experimental, subject to ... - [Support_Matrix](torch-auto-deploy-support-matrix.md): AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workf... - [Checkpoint Loading](torch-features-checkpoint-loading.md): The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different so... - [LoRA (Low-Rank Adaptation)](torch-features-lora.md): LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models... - [Overlap Scheduler](torch-features-overlap-scheduler.md): To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating respon... - [Quantization](torch-features-quantization.md): The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub, - [Sampling](torch-features-sampling.md): The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature... - [KV Cache Manager](torch-kv-cache-manager.md): In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to optimize decoding efficiency, particular... - [Scheduler](torch-scheduler.md): TensorRT LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically a...