# Tensorrt Llm

> Reusable note sections for docs.

---

..
   Reusable note sections for docs.
   Include specific notes using:

   .. include:: <path-to>/note_sections.rst
      :start-after: .. start-note-<name>
      :end-before: .. end-note-<name>

.. start-note-config-flag-alias

.. note::

   **Non-breaking**: ``--config <file.yaml>`` is the preferred flag for passing a :ref:`YAML configuration file <configuring-with-yaml-files>`.
   Existing workflows using ``--extra_llm_api_options <file.yaml>`` continue to work; it is an equivalent alias.

.. end-note-config-flag-alias

.. start-note-traffic-patterns

.. note::

   **Traffic Patterns**: The ISL (Input Sequence Length) and OSL (Output Sequence Length)
   values in each configuration represent the **maximum supported values** for that config.
   Requests exceeding these limits may result in errors.

   To handle requests with input sequences **longer than the configured ISL**, add the following
   to your config file:

   .. code-block:: yaml

      enable_chunked_prefill: true

   This enables chunked prefill, which processes long input sequences in chunks rather than
   requiring them to fit within a single prefill operation. Note that enabling chunked prefill
   does **not** guarantee optimal performance—these configs are tuned for the specified ISL/OSL.

.. end-note-traffic-patterns

.. start-note-quick-start-isl-osl

.. note::

   The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the :ref:`Preconfigured Recipes` section below which covers a larger set of traffic patterns and performance profiles.

.. end-note-quick-start-isl-osl

---

trtllm-bench
===========================

trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It provides three main subcommands for different benchmarking scenarios:

.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias

Syntax
------

.. click:: tensorrt_llm.commands.bench:main
   :prog: trtllm-bench
   :nested: full
   :commands: throughput, latency, build


Dataset preparation
------------------

prepare_dataset.py
^^^^^^^^^^^^^^^^^^

trtllm-bench is designed to work with the `prepare_dataset.py <https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py>`_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports:

**Dataset Types:**

- Real datasets from various sources
- Synthetic datasets with normal or uniform token distributions
- LoRA task-specific datasets

**Key Features:**

- Tokenizer integration for proper text preprocessing
- Configurable random seeds for reproducible results
- Support for LoRA adapters and task IDs
- Output in JSON format compatible with trtllm-bench

.. important::
   The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format.

**Usage:**

prepare_dataset
"""""""""""""""

.. code-block:: bash

    python prepare_dataset.py [OPTIONS]

**Options**

----

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Option
     - Description
   * - ``--tokenizer``
     - Tokenizer directory or HuggingFace model name (required)
   * - ``--output``
     - Output JSON filename (default: preprocessed_dataset.json)
   * - ``--stdout``
     - Print output to stdout with JSON dataset entry on each line (**required for trtllm-bench**)
   * - ``--random-seed``
     - Random seed for token generation (default: 420)
   * - ``--task-id``
     - LoRA task ID (default: -1)
   * - ``--rand-task-id``
     - Random LoRA task range (two integers)
   * - ``--lora-dir``
     - Directory containing LoRA adapters
   * - ``--log-level``
     - Logging level: info or debug (default: info)

dataset
"""""""

Process real datasets from various sources.

.. code-block:: bash

    python prepare_dataset.py dataset [OPTIONS]

**Options**

----

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Option
     - Description
   * - ``--input``
     - Input dataset file or directory (required)
   * - ``--max-input-length``
     - Maximum input sequence length (default: 2048)
   * - ``--max-output-length``
     - Maximum output sequence length (default: 512)
   * - ``--num-samples``
     - Number of samples to process (default: all)
   * - ``--format``
     - Input format: json, jsonl, csv, or txt (default: auto-detect)


token_norm_dist
"""""""""""""""

Generate synthetic datasets with normal token distribution.

.. code-block:: bash

    python prepare_dataset.py token_norm_dist [OPTIONS]

**Options**

----

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Option
     - Description
   * - ``--num-requests``
     - Number of requests to be generated (required)
   * - ``--input-mean``
     - Normal distribution mean for input tokens (required)
   * - ``--input-stdev``
     - Normal distribution standard deviation for input tokens (required)
   * - ``--output-mean``
     - Normal distribution mean for output tokens (required)
   * - ``--output-stdev``
     - Normal distribution standard deviation for output tokens (required)


token_unif_dist
"""""""""""""""

Generate synthetic datasets with uniform token distribution

.. code-block:: bash

    python prepare_dataset.py token_unif_dist [OPTIONS]

**Options**

----

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Option
     - Description
   * - ``--num-requests``
     - Number of requests to be generated (required)
   * - ``--input-min``
     - Uniform distribution minimum for input tokens (required)
   * - ``--input-max``
     - Uniform distribution maximum for input tokens (required)
   * - ``--output-min``
     - Uniform distribution minimum for output tokens (required)
   * - ``--output-max``
     - Uniform distribution maximum for output tokens (required)

---

trtllm-build
===========================

.. argparse::
   :module: tensorrt_llm.commands.build
   :func: parse_arguments
   :prog: trtllm-build

---

trtllm-eval
===========

About
-----

The ``trtllm-eval`` command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with the `accuracy test suite <https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests/integration/defs/accuracy>`_ of TensorRT LLM.

``trtllm-eval`` is built on the offline API -- LLM API. Compared to the online ``trtllm-serve``, the offline API provides clearer error messages and simplifies the debugging workflow.

The following tasks are currently supported:

.. list-table::
   :header-rows: 1
   :widths: 20 25 15 15 15

   * - Dataset
     - Task
     - Metric
     - Default ISL
     - Default OSL
   * - CNN Dailymail
     - summarization
     - rouge
     - 924
     - 100
   * - MMLU
     - QA; multiple choice
     - accuracy
     - 4,094
     - 2
   * - GSM8K
     - QA; regex matching
     - accuracy
     - 4,096
     - 256
   * - GPQA
     - QA; multiple choice
     - accuracy
     - 32,768
     - 4,096
   * - JSON mode eval
     - structured generation
     - accuracy
     - 1,024
     - 512

.. note::

    ``trtllm-eval`` originates from the TensorRT LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time, ``trtllm-eval`` is intended solely for development and is not recommended for production use.

Usage and Examples
------------------

Some evaluation tasks (e.g., GSM8K and GPQA) depend on the ``lm_eval`` package. To run these tasks, you need to install ``lm_eval`` with:

.. code-block:: bash

   pip install -r requirements-dev.txt

Alternatively, you can install the ``lm_eval`` version specified in ``requirements-dev.txt``.

Here are some examples:

.. code-block:: bash

   # Evaluate Llama-3.1-8B-Instruct on MMLU
   trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu

   # Evaluate Llama-3.1-8B-Instruct on GSM8K
   trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k

   # Evaluate Llama-3.3-70B-Instruct on GPQA Diamond
   trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond

The ``--model`` argument accepts either a Hugging Face model ID or a local checkpoint path. By default, ``trtllm-eval`` runs the model with the PyTorch backend; you can pass ``--backend tensorrt`` to switch to the TensorRT backend.

Alternatively, the ``--model`` argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the ``--tokenizer`` argument.

For more details, see ``trtllm-eval --help`` and ``trtllm-eval <task> --help``.

.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias


Syntax
------

.. click:: tensorrt_llm.commands.eval:main
   :prog: trtllm-eval
   :nested: full

---

trtllm-serve
=======================


.. toctree::
   :maxdepth: 1

   trtllm-serve
   run-benchmark-with-trtllm-serve

---

trtllm-serve
============

About
-----

The ``trtllm-serve`` command starts an OpenAI compatible server that supports the following endpoints:

- ``/v1/models``
- ``/v1/completions``
- ``/v1/chat/completions``

For information about the inference endpoints, refer to the `OpenAI API Reference <https://platform.openai.com/docs/api-reference>`__.

The server also supports the following endpoints:

- ``/health``
- ``/metrics``
- ``/version``

The ``metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.

Starting a Server
-----------------

The following abbreviated command syntax shows the commonly used arguments to start a server:

.. code-block:: bash

   trtllm-serve <model> [--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]

For the full syntax and argument descriptions, refer to :ref:`syntax`.

Inference Endpoints
-------------------

After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_ for examples in the following sections.

Chat API
~~~~~~~~

You can query Chat API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../../examples/serve/openai_chat_client.py
    :language: python
    :linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../../examples/serve/curl_chat_client.sh
    :language: bash
    :linenos:

Completions API
~~~~~~~~~~~~~~~

You can query Completions API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../../examples/serve/openai_completion_client.py
    :language: python
    :linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../../examples/serve/curl_completion_client.sh
    :language: bash
    :linenos:

Responses API
~~~~~~~~~~~~~~~

You can query Responses API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../../examples/serve/openai_responses_client.py
    :language: python
    :linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../../examples/serve/curl_responses_client.sh
    :language: bash
    :linenos:


More openai compatible examples can be found in the `compatibility examples <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/serve/compatibility>`_ directory.

Multimodal Serving
~~~~~~~~~~~~~~~~~~

For multimodal models, you need to create a configuration file and start the server with additional options due to the following limitations:

* TRT-LLM multimodal is currently not compatible with ``kv_cache_reuse``
* Multimodal models require ``chat_template``, so only the Chat API is supported

To set up multimodal models:

First, create a configuration file:

.. code-block:: bash

   cat >./config.yml<<EOF
   kv_cache_config:
       enable_block_reuse: false
   EOF

Then, start the server with the configuration file:

.. code-block:: bash

   trtllm-serve Qwen/Qwen2-VL-7B-Instruct \
       --config ./config.yml

Multimodal Chat API
~~~~~~~~~~~~~~~~~~~

You can query Completions API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../../examples/serve/openai_completion_client_for_multimodal.py
    :language: python
    :linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../../examples/serve/curl_chat_client_for_multimodal.sh
    :language: bash
    :linenos:

Multimodal Modality Coverage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TRT-LLM multimodal supports the following modalities and data types (depending on the model):

**Text**

* No type specified:

  .. code-block:: json

     {"role": "user", "content": "What's the capital of South Korea?"}

* Explicit "text" type:

  .. code-block:: json

     {"role": "user", "content": [{"type": "text", "text": "What's the capital of South Korea?"}]}

**Image**

* Using "image_url" with URL:

  .. code-block:: json

     {"role": "user", "content": [
         {"type": "text", "text": "What's in this image?"},
         {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
     ]}

* Using "image_url" with base64-encoded data:

  .. code-block:: json

     {"role": "user", "content": [
         {"type": "text", "text": "What's in this image?"},
         {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{image_base64}"}}
     ]}

.. note::
   To convert images to base64-encoded format, use the utility function
   :func:`tensorrt_llm.utils.load_base64_image`. Refer to the
   `load_base64_image utility <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/utils/load_base64_image.py>`__
   for implementation details.

**Video**

* Using "video_url":

  .. code-block:: json

     {"role": "user", "content": [
         {"type": "text", "text": "What's in this video?"},
         {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}}
     ]}

**Audio**

* Using "audio_url":

  .. code-block:: json

     {"role": "user", "content": [
         {"type": "text", "text": "What's in this audio?"},
         {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.mp3"}}
     ]}


Multi-node Serving with Slurm
-----------------------------

You can deploy `DeepSeek-V3 <https://huggingface.co/deepseek-ai/DeepSeek-V3>`_ model across two nodes with Slurm and ``trtllm-serve``

.. code-block:: bash

    echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > config.yml

    srun -N 2 -w [NODES] \
        --output=benchmark_2node.log \
        --ntasks 16 --ntasks-per-node=8 \
        --mpi=pmix --gres=gpu:8 \
        --container-image=<CONTAINER_IMG> \
        --container-mounts=/workspace:/workspace \
        --container-workdir /workspace \
        bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --config ./config.yml"

See `the source code <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/trtllm-llmapi-launch>`_ of ``trtllm-llmapi-launch`` for more details.

Metrics Endpoint
----------------

.. note::

   The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.

   Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.

   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.

The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.

For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file:

.. code-block:: yaml

   # extra_llm_config.yaml
   enable_iter_perf_stats: true

Start the server and specify the ``--config`` argument with the path to the YAML file:

.. code-block:: bash

   trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --config config.yaml

After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.

.. code-block:: bash

   curl -X GET http://localhost:8000/metrics

Example output:

.. code-block:: json

    [
        {
            "gpuMemUsage": 76665782272,
            "iter": 154,
            "iterLatencyMS": 7.00688362121582,
            "kvCacheStats": {
                "allocNewBlocks": 3126,
                "allocTotalBlocks": 3126,
                "cacheHitRate": 0.00128,
                "freeNumBlocks": 101253,
                "maxNumBlocks": 101256,
                "missedBlocks": 3121,
                "reusedBlocks": 4,
                "tokensPerBlock": 32,
                "usedNumBlocks": 3
            },
            "numActiveRequests": 1
            ...
        }
    ]

.. _configuring-with-yaml-files:

Configuring with YAML Files
----------------------------

You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. The arguments in the file override the corresponding command line arguments.

.. include:: ../../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias

The yaml file is configuration of `tensorrt_llm.llmapi.LlmArgs <https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs>`_, the class has multiple levels of hierarchy, to configure the top level arguments like ``max_batch_size``, the yaml file should be like:

.. code-block:: yaml

   max_batch_size: 8

To configure the nested level arguments like ``moe_config.backend``, the yaml file should be like:

.. code-block:: yaml

   moe_config:
       backend: CUTLASS

Syntax
------

.. click:: tensorrt_llm.commands.serve:main
   :prog: trtllm-serve
   :nested: full

Besides the above examples, `trtllm-serve` is also used as an entrypoint for performance benchmarking.
Please refer to `Performance Benchmarking with `trtllm-serve` <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/commands/trtllm-serve/trtllm-serve-bench.md>` for more details.

---

.. start-config-table-note
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-traffic-patterns
   :end-before: .. end-note-traffic-patterns
.. end-config-table-note

.. start-deepseek-ai/DeepSeek-R1-0528

.. _deepseek-ai/DeepSeek-R1-0528:

`DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :width: 100%
   :header-rows: 1
   :widths: 12 15 15 13 20 25

   * - GPU
     - Performance Profile
     - ISL / OSL
     - Concurrency
     - Config
     - Command
   * - 8xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc64.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc64.yaml``
   * - 8xH200_SXM
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc4.yaml``
   * - 8xH200_SXM
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc8.yaml``
   * - 8xH200_SXM
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc16.yaml``
   * - 8xH200_SXM
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc32.yaml``
   * - 8xH200_SXM
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc64.yaml``
   * - 8xH200_SXM
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc4.yaml``
   * - 8xH200_SXM
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc8.yaml``
   * - 8xH200_SXM
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc16.yaml``
   * - 8xH200_SXM
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc32.yaml``
   * - 8xH200_SXM
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc64.yaml``

.. end-deepseek-ai/DeepSeek-R1-0528

.. start-nvidia/DeepSeek-R1-0528-FP4-v2

.. _nvidia/DeepSeek-R1-0528-FP4-v2:

`DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2>`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :width: 100%
   :header-rows: 1
   :widths: 12 15 15 13 20 25

   * - GPU
     - Performance Profile
     - ISL / OSL
     - Concurrency
     - Config
     - Command
   * - 4xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc4.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc8.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 16
     - `1k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc16.yaml``
   * - 4xB200_NVL
     - Balanced
     - 1024 / 1024
     - 32
     - `1k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc32.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc64.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 128
     - `1k1k_tp4_conc128.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc128.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc128.yaml``
   * - 4xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 256
     - `1k1k_tp4_conc256.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc256.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc256.yaml``
   * - 4xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc4.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc8.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 16
     - `8k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc16.yaml``
   * - 4xB200_NVL
     - Balanced
     - 8192 / 1024
     - 32
     - `8k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc32.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc64.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 128
     - `8k1k_tp4_conc128.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc128.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc128.yaml``
   * - 4xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 256
     - `8k1k_tp4_conc256.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc256.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc256.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 16
     - `1k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - Balanced
     - 1024 / 1024
     - 32
     - `1k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc64.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 128
     - `1k1k_tp8_conc128.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc128.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc128.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 256
     - `1k1k_tp8_conc256.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc256.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc256.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 16
     - `8k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - Balanced
     - 8192 / 1024
     - 32
     - `8k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc64.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 128
     - `8k1k_tp8_conc128.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc128.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc128.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 256
     - `8k1k_tp8_conc256.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc256.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc256.yaml``

.. end-nvidia/DeepSeek-R1-0528-FP4-v2

.. start-openai/gpt-oss-120b

.. _openai/gpt-oss-120b:

`gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :width: 100%
   :header-rows: 1
   :widths: 12 15 15 13 20 25

   * - GPU
     - Performance Profile
     - ISL / OSL
     - Concurrency
     - Config
     - Command
   * - B200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc4.yaml``
   * - B200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc8.yaml``
   * - B200_NVL
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc16.yaml``
   * - B200_NVL
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc32.yaml``
   * - B200_NVL
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc64.yaml``
   * - B200_NVL
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc4.yaml``
   * - B200_NVL
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc8.yaml``
   * - B200_NVL
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc16.yaml``
   * - B200_NVL
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc32.yaml``
   * - B200_NVL
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc64.yaml``
   * - B200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc4.yaml``
   * - B200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc8.yaml``
   * - B200_NVL
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc16.yaml``
   * - B200_NVL
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc32.yaml``
   * - B200_NVL
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc64.yaml``
   * - 2xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc4.yaml``
   * - 2xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc8.yaml``
   * - 2xB200_NVL
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc16.yaml``
   * - 2xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc32.yaml``
   * - 2xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc64.yaml``
   * - 2xB200_NVL
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc4.yaml``
   * - 2xB200_NVL
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc8.yaml``
   * - 2xB200_NVL
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc16.yaml``
   * - 2xB200_NVL
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc32.yaml``
   * - 2xB200_NVL
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc64.yaml``
   * - 2xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc4.yaml``
   * - 2xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc8.yaml``
   * - 2xB200_NVL
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc16.yaml``
   * - 2xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc32.yaml``
   * - 2xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc64.yaml``
   * - 4xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc4.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc8.yaml``
   * - 4xB200_NVL
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc16.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc32.yaml``
   * - 4xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc64.yaml``
   * - 4xB200_NVL
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc4.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc8.yaml``
   * - 4xB200_NVL
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc16.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc32.yaml``
   * - 4xB200_NVL
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc64.yaml``
   * - 4xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc4.yaml``
   * - 4xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc8.yaml``
   * - 4xB200_NVL
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc16.yaml``
   * - 4xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc32.yaml``
   * - 4xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc64.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc64.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc64.yaml``
   * - 8xB200_NVL
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc4.yaml``
   * - 8xB200_NVL
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc8.yaml``
   * - 8xB200_NVL
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc16.yaml``
   * - 8xB200_NVL
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc32.yaml``
   * - 8xB200_NVL
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc64.yaml``
   * - H200_SXM
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc4.yaml``
   * - H200_SXM
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc8.yaml``
   * - H200_SXM
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc16.yaml``
   * - H200_SXM
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc32.yaml``
   * - H200_SXM
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc64.yaml``
   * - H200_SXM
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc4.yaml``
   * - H200_SXM
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc8.yaml``
   * - H200_SXM
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc16.yaml``
   * - H200_SXM
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc32.yaml``
   * - H200_SXM
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc64.yaml``
   * - H200_SXM
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp1_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc4.yaml``
   * - H200_SXM
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp1_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc8.yaml``
   * - H200_SXM
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp1_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc16.yaml``
   * - H200_SXM
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp1_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc32.yaml``
   * - H200_SXM
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp1_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc64.yaml``
   * - 2xH200_SXM
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc4.yaml``
   * - 2xH200_SXM
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc8.yaml``
   * - 2xH200_SXM
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc16.yaml``
   * - 2xH200_SXM
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc32.yaml``
   * - 2xH200_SXM
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc64.yaml``
   * - 2xH200_SXM
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc4.yaml``
   * - 2xH200_SXM
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc8.yaml``
   * - 2xH200_SXM
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc16.yaml``
   * - 2xH200_SXM
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc32.yaml``
   * - 2xH200_SXM
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc64.yaml``
   * - 2xH200_SXM
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp2_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc4.yaml``
   * - 2xH200_SXM
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp2_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc8.yaml``
   * - 2xH200_SXM
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp2_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc16.yaml``
   * - 2xH200_SXM
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp2_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc32.yaml``
   * - 2xH200_SXM
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp2_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc64.yaml``
   * - 4xH200_SXM
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc4.yaml``
   * - 4xH200_SXM
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc8.yaml``
   * - 4xH200_SXM
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc16.yaml``
   * - 4xH200_SXM
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc32.yaml``
   * - 4xH200_SXM
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc64.yaml``
   * - 4xH200_SXM
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc4.yaml``
   * - 4xH200_SXM
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc8.yaml``
   * - 4xH200_SXM
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc16.yaml``
   * - 4xH200_SXM
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc32.yaml``
   * - 4xH200_SXM
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc64.yaml``
   * - 4xH200_SXM
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp4_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc4.yaml``
   * - 4xH200_SXM
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp4_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc8.yaml``
   * - 4xH200_SXM
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp4_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc16.yaml``
   * - 4xH200_SXM
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp4_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc32.yaml``
   * - 4xH200_SXM
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp4_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc64.yaml``
   * - 8xH200_SXM
     - Min Latency
     - 1024 / 1024
     - 4
     - `1k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc4.yaml``
   * - 8xH200_SXM
     - Low Latency
     - 1024 / 1024
     - 8
     - `1k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc8.yaml``
   * - 8xH200_SXM
     - Balanced
     - 1024 / 1024
     - 16
     - `1k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc16.yaml``
   * - 8xH200_SXM
     - High Throughput
     - 1024 / 1024
     - 32
     - `1k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc32.yaml``
   * - 8xH200_SXM
     - Max Throughput
     - 1024 / 1024
     - 64
     - `1k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc64.yaml``
   * - 8xH200_SXM
     - Min Latency
     - 1024 / 8192
     - 4
     - `1k8k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc4.yaml``
   * - 8xH200_SXM
     - Low Latency
     - 1024 / 8192
     - 8
     - `1k8k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc8.yaml``
   * - 8xH200_SXM
     - Balanced
     - 1024 / 8192
     - 16
     - `1k8k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc16.yaml``
   * - 8xH200_SXM
     - High Throughput
     - 1024 / 8192
     - 32
     - `1k8k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc32.yaml``
   * - 8xH200_SXM
     - Max Throughput
     - 1024 / 8192
     - 64
     - `1k8k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc64.yaml``
   * - 8xH200_SXM
     - Min Latency
     - 8192 / 1024
     - 4
     - `8k1k_tp8_conc4.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc4.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc4.yaml``
   * - 8xH200_SXM
     - Low Latency
     - 8192 / 1024
     - 8
     - `8k1k_tp8_conc8.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc8.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc8.yaml``
   * - 8xH200_SXM
     - Balanced
     - 8192 / 1024
     - 16
     - `8k1k_tp8_conc16.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc16.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc16.yaml``
   * - 8xH200_SXM
     - High Throughput
     - 8192 / 1024
     - 32
     - `8k1k_tp8_conc32.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc32.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc32.yaml``
   * - 8xH200_SXM
     - Max Throughput
     - 8192 / 1024
     - 64
     - `8k1k_tp8_conc64.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc64.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc64.yaml``

.. end-openai/gpt-oss-120b

---

Model Recipes
================

Quick Start for Popular Models
-------------------------------

The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.

We maintain LLM API configuration files for these models containing recommended performance settings in two locations:

* **Curated Examples**: `examples/configs/curated <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/curated>`_ - Hand-picked configurations for common scenarios.
* **Comprehensive Database**: `examples/configs/database <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/database>`_ - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.

The TensorRT LLM Docker container makes these config files available at ``/app/tensorrt_llm/examples/configs/curated`` and ``/app/tensorrt_llm/examples/configs/database`` respectively. You can reference them as needed:

.. code-block:: bash

   export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment

.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-quick-start-isl-osl
   :end-before: .. end-note-quick-start-isl-osl

This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 20 30

   * - Model Name
     - GPU
     - Inference Scenario
     - Config
     - Command
   * - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
     - H100, H200
     - Max Throughput
     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
   * - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
     - B200, GB200
     - Max Throughput
     - `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-deepgemm.yaml>`_
     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml``
   * - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4>`_
     - B200, GB200
     - Max Throughput
     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-FP4 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
   * - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2>`_
     - B200, GB200
     - Min Latency
     - `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-latency.yaml>`_
     - ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml``
   * - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
     - Any
     - Max Throughput
     - `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-throughput.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml``
   * - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
     - Any
     - Min Latency
     - `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-latency.yaml>`_
     - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml``
   * - `Qwen3-Next-80B-A3B-Thinking <https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking>`_
     - Any
     - Max Throughput
     - `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3-next.yaml>`_
     - ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --config ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml``
   * - Qwen3 family (e.g. `Qwen3-30B-A3B <https://huggingface.co/Qwen/Qwen3-30B-A3B>`_)
     - Any
     - Max Throughput
     - `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3.yaml>`_
     - ``trtllm-serve Qwen/Qwen3-30B-A3B --config ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml`` (swap to another Qwen3 model name as needed)
   * - `Llama-3.3-70B (FP8) <https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8>`_
     - Any
     - Max Throughput
     - `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-3.3-70b.yaml>`_
     - ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml``
   * - `Llama 4 Scout (FP8) <https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8>`_
     - Any
     - Max Throughput
     - `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-4-scout.yaml>`_
     - ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml``

Model-Specific Deployment Guides
---------------------------------

The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.

.. toctree::
   :maxdepth: 1
   :name: Deployment Guides

   deployment-guide-for-deepseek-r1-on-trtllm.md
   deployment-guide-for-llama3.3-70b-on-trtllm.md
   deployment-guide-for-llama4-scout-on-trtllm.md
   deployment-guide-for-gpt-oss-on-trtllm.md
   deployment-guide-for-qwen3-on-trtllm.md
   deployment-guide-for-qwen3-next-on-trtllm.md
   deployment-guide-for-kimi-k2-thinking-on-trtllm.md

Preconfigured Recipes
---------------------

.. _recipe-selector:

Recipe selector
^^^^^^^^^^^^^^^

.. trtllm_config_selector::

.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-traffic-patterns
   :end-before: .. end-note-traffic-patterns

.. _recipe-database:

Recipe database
^^^^^^^^^^^^^^^

The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.

.. include:: config_table.rst
   :start-after: .. end-config-table-note

---

Dynamo K8s Example
=================================

This example demonstrates how to deploy TensorRT-LLM on a Kubernetes cluster
using Dynamo Cloud.  Dynamo provides an operator-based approach to manage the
lifecycle of model deployments through Custom Resource Definitions (CRDs).
Please see `Dynamo Kubernetes Quick Start Guide <https://docs.nvidia.com/dynamo/latest/kubernetes/README.html>`_
for more details.

---

=======================================================
LLM Examples Introduction
=======================================================

Here is a simple example to show how to use the LLM with TinyLlama.

.. literalinclude:: ../../../examples/llm-api/quickstart_example.py
   :language: python
   :linenos:

The LLM API can be used for both offline or online usage. See more examples of the LLM API here:

.. toctree::
    :maxdepth: 1
    :caption: LLM API Examples

    %EXAMPLE_DOCS%

For more details on how to fully utilize this API, check out:

* `Common customizations <customization.html>`_
* `LLM API Reference <../llm-api/index.html>`_

---

.. TensorRT LLM documentation master file, created by
   sphinx-quickstart on Wed Sep 20 08:35:21 2023.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to TensorRT LLM's Documentation!
========================================

.. toctree::
   :maxdepth: 2
   :caption: Getting Started
   :name: Getting Started

   overview.md
   quick-start-guide.md
   installation/index.rst


.. toctree::
   :maxdepth: 2
   :caption: Deployment Guide
   :name: Deployment Guide

   examples/llm_api_examples.rst
   examples/trtllm_serve_examples
   examples/dynamo_k8s_example.rst
   deployment-guide/index.rst

.. toctree::
   :maxdepth: 2
   :caption: Models
   :name: Models

   models/supported-models.md
   models/adding-new-model.md


.. toctree::
   :maxdepth: 2
   :caption: CLI Reference
   :name: CLI Reference

   commands/trtllm-bench
   commands/trtllm-eval
   commands/trtllm-serve/index


.. toctree::
   :maxdepth: 2
   :caption: API Reference

   llm-api/index.md
   llm-api/reference.rst


.. toctree::
   :maxdepth: 2
   :caption: Features

   features/feature-combination-matrix.md
   features/attention.md
   features/disagg-serving.md
   features/kvcache.md
   features/long-sequence.md
   features/lora.md
   features/multi-modality.md
   features/overlap-scheduler.md
   features/paged-attention-ifb-scheduler.md
   features/parallel-strategy.md
   features/quantization.md
   features/sampling.md
   features/additional-outputs.md
   features/guided-decoding.md
   features/speculative-decoding.md
   features/checkpoint-loading.md
   features/auto_deploy/auto-deploy.md
   features/ray-orchestrator.md
   features/torch_compile_and_piecewise_cuda_graph.md
   features/helix.md
   features/kv-cache-connector.md


.. toctree::
   :maxdepth: 2
   :caption: Developer Guide

   developer-guide/overview.md
   developer-guide/perf-analysis.md
   developer-guide/perf-benchmarking.md
   developer-guide/ci-overview.md
   developer-guide/dev-containers.md
   developer-guide/api-change.md
   developer-guide/kv-transfer.md


.. toctree::
   :maxdepth: 2
   :caption: Blogs
   :glob:

   blogs/tech_blog/*
   blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
   blogs/H200launch.md
   blogs/XQA-kernel.md
   blogs/H100vsA100.md


.. toctree::
   :maxdepth: 2
   :caption: Quick Links

   Releases <https://github.com/NVIDIA/TensorRT-LLM/releases>
   Github Code <https://github.com/NVIDIA/TensorRT-LLM>
   Roadmap <https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap>

.. toctree::
   :maxdepth: 2
   :caption: Use TensorRT Engine
   :hidden:

   legacy/tensorrt_quickstart.md

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

---

.. _installation:

Installation
============

There are multiple ways to install and run TensorRT LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features.

Note: **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**

1. :ref:`containers`

2. Pre-built release wheels on `PyPI <https://pypi.org/project/tensorrt-llm>`_ (see :ref:`linux`)

3. :ref:`build-from-source-linux`

.. toctree::
   :maxdepth: 1
   :caption:  Links
   :hidden:

   containers
   linux
   build-from-source-linux

---

Performance Tuning Guide
=======================

.. include:: introduction.md
   :parser: myst_parser.sphinx_

.. toctree::
   :maxdepth: 1

   benchmarking-default-performance
   useful-build-time-flags
   tuning-max-batch-size-and-max-num-tokens
   deciding-model-sharding-strategy
   fp8-quantization
   useful-runtime-flags

---

Functionals
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

.. automodule:: tensorrt_llm.functional
   :members:
   :undoc-members:
   :show-inheritance:

---

Layers
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

Activation
------------
.. automodule:: tensorrt_llm.layers.activation
   :members:
   :undoc-members:
   :show-inheritance:

Attention
------------
.. automodule:: tensorrt_llm.layers.attention
   :members:
   :undoc-members:
   :show-inheritance:

Cast
------------
.. automodule:: tensorrt_llm.layers.cast
   :members:
   :undoc-members:
   :show-inheritance:

Conv
------------
.. automodule:: tensorrt_llm.layers.conv
   :members:
   :undoc-members:
   :show-inheritance:

Embedding
------------
.. automodule:: tensorrt_llm.layers.embedding
   :members:
   :undoc-members:
   :show-inheritance:

Linear
------------
.. automodule:: tensorrt_llm.layers.linear
   :members:
   :undoc-members:
   :show-inheritance:

MLP
------------
.. automodule:: tensorrt_llm.layers.mlp
   :members:
   :undoc-members:
   :show-inheritance:

Normalization
---------------
.. automodule:: tensorrt_llm.layers.normalization
   :members:
   :undoc-members:
   :show-inheritance:

Pooling
------------
.. automodule:: tensorrt_llm.layers.pooling
   :members:
   :undoc-members:
   :show-inheritance:

---

Models
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

.. automodule:: tensorrt_llm.models
   :members:
   :undoc-members:
   :show-inheritance:

---

Plugin
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

.. automodule:: tensorrt_llm.plugin
   :members:
   :show-inheritance:

---

Quantization
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

.. automodule:: tensorrt_llm.quantization
   :members:
   :show-inheritance:

---

Runtime
===========================

.. automodule:: tensorrt_llm

.. currentmodule:: tensorrt_llm

.. automodule:: tensorrt_llm.runtime
   :members:
   :undoc-members:
   :show-inheritance:

---

# How to get best performance on DeepSeek-R1 in TensorRT LLM

NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. [NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)

In this blog, we share the configurations and procedures about how to reproduce the number on both B200 and H200 with PyTorch workflow.

## Table of Contents

- [How to get best performance on DeepSeek-R1 in TensorRT LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
  - [Table of Contents](#table-of-contents)
  - [Prerequisites: Install TensorRT LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models)
      - [1. Download TensorRT LLM](#1-download-tensorrt-llm)
      - [2. Download the DeepSeek R1 models](#2-download-the-deepseek-r1-models)
      - [3. Build and run TensorRT LLM container](#3-build-and-run-tensorrt-llm-container)
      - [4. Compile and Install TensorRT LLM](#4-compile-and-install-tensorrt-llm)
      - [5. Optional: Tune GPU clocks](#5-optional-tune-gpu-clocks)
      - [6. Dataset preparation](#6-dataset-preparation)
  - [Reproducing steps](#reproducing-steps)
    - [B200 min-latency](#b200-min-latency)
      - [Expected Results](#expected-results)
    - [B200 max-throughput for R1-0528 with FP8 KV cache](#b200-max-throughput-for-r1-0528-with-fp8-kv-cache)
      - [Benchmark](#benchmark)
      - [Expected Result Format](#expected-result-format)
    - [B200 max-throughput for R1 with FP16 KV cache](#b200-max-throughput-for-r1-with-fp16-kv-cache)
      - [Benchmark](#benchmark-1)
      - [Expected Result Format](#expected-result-format-1)
    - [H200 min-latency](#h200-min-latency)
      - [Expected Result Format](#expected-result-format-2)
    - [H200 max-throughput](#h200-max-throughput)
      - [Expected Result Format](#expected-result-format-3)
  - [Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations)
    - [WIP: Enable more features by default](#wip-enable-more-features-by-default)
    - [MLA chunked context](#mla-chunked-context)
    - [Out of memory issues](#out-of-memory-issues)


## Prerequisites: Install TensorRT LLM and download models

This section can be skipped if you already have TensorRT LLM installed and have already downloaded the DeepSeek R1 model checkpoint.

#### 1. Download TensorRT LLM

**You can also find more comprehensive instructions to install TensorRT LLM in this [TensorRT LLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.**

``` bash
# Prerequisites
apt-get update && apt-get -y install git git-lfs
git lfs install

# Replace with your actual path
YOUR_WORK_PATH=<YOUR_WORK_PATH>

# Clone the TensorRT LLM repository
cd $YOUR_WORK_PATH
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
```
**Note**: Replace `<*_PATH>` to your actual path.

#### 2. Download the DeepSeek R1 models

For NVIDIA Blackwell GPUs, it's recommended to use the [FP4 quantized version of DeepSeek R1](https://huggingface.co/nvidia/DeepSeek-R1-FP4) to get the best performance.
For NVIDIA Hopper GPUs, it's recommended to use the FP8 version of the DeepSeek R1 model.

```bash
# Replace with your actual path
YOUR_MODEL_PATH=<YOUR_MODEL_PATH>
cd $YOUR_MODEL_PATH

## Download NVFP4 model for Blackwell GPUs
git clone https://huggingface.co/nvidia/DeepSeek-R1-NVFP4-v2

## Or the 0528 version
git clone https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2

## Download FP8 model for Hopper GPUs
## FP8 model also works for Blackwell, but FP4 has the best performance on Blackwell.
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1
```

#### 3. Build and run TensorRT LLM container

``` bash
cd TensorRT-LLM
make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH"
```
Here we set `LOCAL_USER=1` argument to set up the local user instead of root account inside the container, you can remove it if running as root inside container is fine.

#### 4. Compile and Install TensorRT LLM
Here we compile the source inside the container:

``` bash
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "90-real;100-real"  --python_bindings --clean
```
You can set the cuda_architectures to "100-real" if targeting Blackwell only, and "90-real" to target Hopper only to save some build time.

Install and set environment variables:

```bash
pip install --user build/tensorrt_llm*.whl
export PATH=${HOME}/.local/bin:${PATH}
export PYTHONPATH=`pwd`
```

#### 5. Optional: Tune GPU clocks
```
sudo nvidia-smi -pm 0; sudo nvidia-smi -pm 1; sudo nvidia-smi boost-slider --vboost 4
```
The boost-slider option will tune the GPU clock and can get you slight perf increase, for B200 min-latency scenarios it's about 8 TPS/USER.
This is not a required step, it's provided here to make sure the perf numbers in this doc can be reproduced more closely to our internal run.

#### 6. Dataset preparation

The trtllm-bench tool requires a dataset file to read prompts and output sequence length of each prompt. Format details of this dataset file can be seen in [preparing a dataset](
https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html#preparing-a-dataset).

For min-latency benchmarking, **real dataset is required** since the MTP accept rate is affected by the dataset thus affecting the performance. You can use your own dataset following the format described in the link above.

For the max-throughput benchmarking, synthetic dataset is enough to be representative, since it does not use MTP.
The command to generate synthetic dataset will be attached to the max throughput section.

## Reproducing steps

This section provides the reproducing steps for NVIDIA Blackwell B200 and H200 GPUs, for both min-latency and max-throughput scenarios.

All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT LLM installation, see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool.

For brevity, we only provide the commands to reproduce the perf numbers without detailed explanation of the tools and options in this doc.

All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRT LLM container section](#3-build-and-run-tensorrt-llm-container)

### B200 min-latency
Our benchmark results are based on **Batch = 1, ISL = 1K, OSL = 2K, num_requests = 10 from real dataset**

To do the benchmark, run the following command:

```bash
YOUR_DATA_PATH=<your dataset file following the format>

cat >./config.yml<<EOF
moe_config:
  backend: TRTLLM
speculative_config:
    decoding_type: MTP
    num_nextn_predict_layers: 3
EOF

export TRTLLM_ENABLE_PDL=1

trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
    throughput \
    --dataset $YOUR_DATA_PATH \
    --num_requests 10 \
    --concurrency 1 \
    --max_batch_size 1 \
    --tp 8 \
    --ep 2 \
    --config ./config.yml
```

Explanation:
- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
- `--dataset`: Prompt dataset used to benchmark. Our official benchmark dataset has ISL = 1K, OSL = 2K
- `--num_requests`: Num requests used for the benchmark.
- `--concurrency`: Total concurrency for the system.
- `--max_batch_size`: Max batch size in each rank.
- `--tp`: Tensor parallel size.
- `--ep`: Expert parallel size.
- `--config`: Used to specify extra YAML configuration. The content of the file is as follows:

#### Expected Results
The perf can be different when using different datasets and different machines.

```
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     0.1341
Total Output Throughput (tokens/sec):             274.4168
Per User Output Throughput (tokens/sec/user):     274.7188
Per GPU Output Throughput (tokens/sec/gpu):       34.3021
Total Token Throughput (tokens/sec):              414.0461
Total Latency (ms):                               74561.7520
Average request latency (ms):                     7456.1219
```
### B200 max-throughput for R1-0528 with FP8 KV cache

Due to our evaluation found that FP8 KV cache does not introduce obvious accuracy drop compared to BF16 KV cache. See [Precision strategy](./tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md#precision-strategy), the latest [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4) checkpoint had enabled FP8 KV cache by-default.

We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.

!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.

#### Benchmark
```bash
cat >./config.yml <<EOF
cuda_graph_config:
  enable_padding: true
  batch_sizes:
  - 896
  - 512
  - 256
  - 128
  - 64
  - 32
  - 16
  - 8
  - 4
  - 2
  - 1
print_iter_log: true
kv_cache_dtype: fp8
enable_attention_dp: true
EOF
trtllm-bench  --model nvidia/DeepSeek-R1-0528-FP4
     throughput
     --dataset ${YOUR_DATA_PATH}
     --tp 8  --ep 8
     --config ./config.yml
     --max_batch_size 896
     --max_num_tokens 2048
     --kv_cache_free_gpu_mem_fraction 0.93
     --concurrency 7168
     --num_requests 114688
```
#### Expected Result Format
```
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     21.0675
Total Output Throughput (tokens/sec):             43146.2042
Total Token Throughput (tokens/sec):              65100.6376
Total Latency (ms):                               5443839.8140
Average request latency (ms):                     332826.9898
Per User Output Throughput [w/ ctx] (tps/user):   6.1806
Per GPU Output Throughput (tps/gpu):              5393.2755
```

### B200 max-throughput for R1 with FP16 KV cache
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**.

The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.

!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.

#### Benchmark
To do the benchmark, run the following command:

```bash
# generate synthetic dataset
trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
        prepare-dataset \
        --output dataset.txt \
        token-norm-dist \
        --input-mean 1024 --output-mean 2048 \
        --input-stdev 0 --output-stdev 0 \
        --num-requests 49152

YOUR_DATA_PATH=./dataset.txt

cat >./config.yml <<EOF
cuda_graph_config:
  enable_padding: true
  batch_sizes:
  - 1
  - 2
  - 4
  - 8
  - 16
  - 32
  - 64
  - 128
  - 256
  - 384
print_iter_log: ${PRINT_ITER_LOG}
enable_attention_dp: true
EOF

trtllm-bench -m nvidia/DeepSeek-R1-FP4 \
    throughput \
    --tp 8 \
    --ep 8 \
    --warmup 0 \
    --dataset ${YOUR_DATA_PATH} \
    --max_batch_size 384 \
    --max_num_tokens 1536 \
    --num_requests 49152 \
    --concurrency 3072 \
    --kv_cache_free_gpu_mem_fraction 0.85 \
    --config ./config.yml
```

#### Expected Result Format
The perf might be different from different datasets and machines
```
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     17.7657
Total Output Throughput (tokens/sec):             36384.0838
Total Token Throughput (tokens/sec):              54576.1257
Total Latency (ms):                               2766684.9197
Average request latency (ms):                     172321.7206
Per User Output Throughput [w/ ctx] (tps/user):   11.9263
Per GPU Output Throughput (tps/gpu):              4548.0105
```

### H200 min-latency
Our benchmark results are based on **Batch = 1, ISL = 1K, OSL = 2K, num_requests = 10 from real dataset**
To do the benchmark, run the following command:

```bash
YOUR_DATA_PATH=<your dataset file following the format>

cat >./config.yml<<EOF
speculative_config:
    decoding_type: MTP
    num_nextn_predict_layers: 3
EOF

trtllm-bench --model deepseek-ai/DeepSeek-R1 \
    throughput \
    --dataset $YOUR_DATA_PATH \
    --num_requests 10 \
    --max_batch_size 1 \
    --tp 8 \
    --ep 4 \
    --concurrency 1 \
    --config ./config.yml
```

#### Expected Result Format

The perf might be different from different datasets and machines
```
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     0.0772
Total Output Throughput (tokens/sec):             158.0669
Per User Output Throughput (tokens/sec/user):     158.1196
Per GPU Output Throughput (tokens/sec/gpu):       19.7584
Total Latency (ms):                               129498.2168
Average request latency (ms):                     12945.9379
```

### H200 max-throughput
Our benchmark results are based on **Batch = 1024, ISL = 1K, OSL = 2K, num_requests = 5120 from real dataset**
To do the benchmark, run the following command:

```bash
# generate synthetic dataset
trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
        prepare-dataset \
        --output dataset.txt \
        token-norm-dist \
        --input-mean 1024 --output-mean 2048 \
        --input-stdev 0 --output-stdev 0 \
        --num-requests 5120

YOUR_DATA_PATH=./dataset.txt

cat >./config.yml<<EOF
cuda_graph_config:
  batch_sizes:
  - 128
enable_attention_dp: true
EOF

# Use NVCC for DeepGEMM JIT compilation
export TRTLLM_DG_JIT_USE_NVCC=1

trtllm-bench -m deepseek-ai/DeepSeek-R1 \
    throughput \
    --tp 8 \
    --ep 8 \
    --warmup 0 \
    --dataset $YOUR_DATA_PATH \
    --max_batch_size 128 \
    --max_num_tokens 1151 \
    --num_requests 5120 \
    --concurrency 1024 \
    --kv_cache_free_gpu_mem_fraction 0.8 \
    --config ./config.yml
```

#### Expected Result Format
The perf might be different from different datasets and machines

```
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     5.6100
Total Output Throughput (tokens/sec):             11489.2671
Per User Output Throughput (tokens/sec/user):     11.3476
Per GPU Output Throughput (tokens/sec/gpu):       1436.1584
Total Token Throughput (tokens/sec):              17233.9007
Total Latency (ms):                               912656.9938
Average request latency (ms):                     181540.5739
```

## Exploring more ISL/OSL combinations

To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use the `trtllm-bench prepare-dataset` subcommand to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
### WIP: Enable more features by default

Currently, there are some features that need to be enabled through a user-defined file `config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.

Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.

Generally, you should make sure that `max_batch_size` is not too low to bottleneck the throughput, and `max_num_tokens` needs to be large enough so that it covers the max input sequence length of the samples in dataset, as mentioned in below section "WIP: Chunked context support on DeepSeek models".

For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).

### MLA chunked context

MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.

For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).

### Out of memory issues

It's possible seeing OOM issues on some cases. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem.

---

# Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100

H200's large capacity & high memory bandwidth, paired with TensorRT LLM's
optimizations, maximizes inference performance.

## Falcon-180B on a single H200 with INT4 AWQ
[Falcon-180B](https://huggingface.co/tiiuae/falcon-180B), one of the largest &
most accurate open source models available, can run on a *single* H200 GPU.

The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with
FP8, allows for the entire large language model to fit on a single GPU, where
previously eight A100s were required. H200 Falcon-180B provides up to **800**
tok/s and retains high accuracy.

**Model Performance:**
H200's large capacity & high memory bandwidth, utilizing INT4 AWQ to reduce
memory footprint, allows for great performance on Falcon-180B on a single GPU.


<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_tps.png?raw=true" alt="Falcon-180B performance comparison" width="450" height="auto">

<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
<sup>
TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ |
BS: (in order) 256, 128 </sup>


**Model Accuracy:**
Often quantization can have adverse impacts on the accuracy of the model,
however, TensorRT LLM's AWQ decreases memory footprint of the model by **4x**
while maintaining high accuracy.

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/5aec7af45fc0abd876fa68a9ae8c8cae084f3af3/docs/source/blogs/media/Falcon180B-H200_acc.png?raw=true" alt="Falcon-180B accuracy comparison" width="600" height="auto">


<sup>Preliminary measured accuracy, subject to change. </sup>
<sup>
TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ
</sup>

[**INT4 Activation-aware Weight Quantization
(AWQ)**](https://arxiv.org/abs/2306.00978) (Lin et al., 2023) is a quantization
technique which compresses the weights of an LLM down to 4bits based on their
relative importance, and performs computation in FP16. This allows for AWQ to
retain higher accuracy than other 4bit methods and reduce memory usage, but
requires special kernels capable of handling the change in precision
performantly.

TensorRT LLM has implemented custom kernels for AWQ, and taken the technique a
step further by performing FP8 computation on Hopper GPUs instead of the
standard FP16.

Similar examples running Falcon-180B with quantization in TensorRT LLM are
available in [examples/models/contrib/falcon](/examples/models/contrib/falcon).

## Llama-70B on H200 up to 6.7x A100

TensorRT LLM has improved its Group Query Attention (GQA) kernels, in the
generation phase, providing up to 2.4x improvement on Llama-70B over
TensorRT LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster
than A100.

**H200 6.7x A100**

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_H200vA100.png?raw=true" alt="Llama-70B H200 vs A100 comparison" width="600" height="auto">


|Model     |GPUs | Input Length | Output Length | Throughput (out tok/s/GPU)|
|:---------|:----|:-------------|:--------------|:------|
|Llama-70B |   1 |           128|           128 | 3,803 |
|          |   8 |              |               | 3,803 |
|          |   1 |              |          2048 | 2,941 |
|          |   8 |              |               | 3,163 |
|          |   1 |              |          4096 | 1,946 |
|          |   8 |              |               | 2,263 |


<sup>Preliminary measured performance, subject to change. </sup>
<sup>
TensorRT LLM v0.7a |
Llama2-70B |
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
FP8 |
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>


**TensorRT LLM GQA now 2.4x faster on H200**

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true" alt="Llama-70B H200 December vs Oct." width="400" height="auto">

<sup>Preliminary measured performance, subject to change.</sup>
<sup>
TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
Llama2-70B |
1xH200 TP1 |
FP8 |
BS 192 </sup>

[**Grouped Query Attention (GQA)**](https://arxiv.org/abs/2305.13245v2)
(Ainslie et al., 2023), used in Llama-70B, is a variant of Multihead Attention
(MHA) which groups key-value (KV) heads together, resulting in fewer KV heads
than query (Q) heads. TensorRT LLM has a custom implementation of MHA which
supports GQA, multi-query attention (MQA) and standard MHA. It leverages Tensor
Cores, including in the generation phase, and delivers great performance on
NVIDIA GPUs.

###### Closing

These improvements will be published in the `main` branch soon, and will be
included in the v0.7 & v0.8 releases.

Similar examples running Llama-70B in TensorRT LLM are published in
[examples/models/core/llama](/examples/models/core/llama).

For more information about H200, please see the [H200 announcement blog](./H200launch.md).

Throughput is calculated as output tokens per second per gpu.
`out_tps=output_seqlen*batch_size/total_latency/tp`

<sub> **Glossary:**
| DP  = Data Parallel
  ISL = Input Sequence Length
| PP  = Pipeline Parallel
| OSL = Output Sequence Length
| OOM = Out of Memory
| TP  = Tensor Parallel <sub/>

---

> :bangbang: :new: *NVIDIA H200 has been announced & is optimized on TensorRT LLM. Learn more about H200, & H100 comparison, here:* [**H200** achieves nearly **12,000 tokens/sec on Llama2-13B** with TensorRT LLM](./H200launch.md)


# H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For min-latency applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.


<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_tps.png?raw=true" alt="max throughput" width="500" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_1st.png?raw=true" alt="1st token latency" width="500" height="auto">

<sub>TensorRT LLM throughput & first token latency on H100 & A100. H100 FP8, A100 FP16, SXM 80GB GPUs, ISL/OSL's provided, TP=1, BS=32/64 max throughput, BS=1 1st token latency. TensorRT LLM v0.5.0, TensorRT 9.1. </sub>
<sub>Max throughput calculated by sweeping BS 1,2,...,64. Throughput taken at largest successful.</sub>

**Max Throughput & Min Latency**
| Model                        | Batch Size | Input Length | Output Length | Throughput (out tok/s) | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :----------- | :------------ | ---------------------: | ---------------------: |
| **H100**
| GPT-J 6B                     | 64         | 128          | 128           |             **10,907** |                    102 |
| GPT-J 6B                     | 1          | 128          | -             |                    185 |                **7.1** |
| **A100** |
| GPT-J 6B                     | 64         | 128          | 128           |                  3,679 |                    481 |
| GPT-J 6B                     | 1          | 128          | -             |                    111 |                   12.5 |
| **Speedup** |
| GPT-J 6B                     | 64         | 128          | 128           |               **3.0x** |               **4.7x** |
| GPT-J 6B                     | 1          | 128          | -             |               **2.4x** |                   1.7x |

<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub>

The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)

Stay tuned for a highlight on Llama coming soon!

## MLPerf on H100 with FP8
In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.

## What is H100 FP8?
H100 is NVIDIA's next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.

FP8 specification introduced in the paper [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433) can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.

In practice, FP8 can improve perceived performance on H100 (FP8 vs FP16) by more than 2x. FP8 is a W8A8 format, meaning the weights are stored in 8bit, as are the activations, or compute. 8bit weights decrease GPU memory consumption & bandwidth meaning a larger model, sequence length, or batchsize can be fit into the same GPU. This can enable new use cases, and larger max batch size can increase max throughput beyond 2x of FP16 H100.

---

:loudspeaker: Note: The below data is using TensorRT LLM v0.5. There have been significant improvements in v0.6 & later. Please see updated Llama performance [here](./Falcon180B-H200.md).

# H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM

TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform) achieves **11,819 tokens/s on Llama2-13B** on a single GPU. H200 is up to **1.9x faster** than H100. This performance is enabled by H200's larger, faster [HBM3e memory](#latest-hbm-memory).


**H200 FP8 Max throughput**

|Model      | Batch Size<sup>(1)</sup> | TP<sup>(2)</sup> | Input Length | Output Length | Throughput (out tok/s/GPU) |
|:----------|:-------------------------|:-----------------|:-------------|:--------------|---------------------------:|
| llama_13b | 1024                     | 1                | 128          | 128           |                     11,819 |
| llama_13b | 128                      | 1                | 128          | 2048          |                      4,750 |
| llama_13b | 64                       | 1                | 2048         | 128           |                      1,349 |
| llama_70b | 512                      | 1                | 128          | 128           |                      3,014 |
| llama_70b | 512                      | 2                | 128          | 2048          |                      1,654 |
| llama_70b | 64                       | 1                | 2048         | 128           |                        341 |
| llama_70b | 32                       | 1                | 2048         | 128           |                        303 |

<sub>Preliminary measured performance, subject to change. TensorRT LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. </sub>

<sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>

Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).

### H200 vs H100

H200's HBM3e larger capacity & faster memory enables up to **1.9x** performance on LLMs compared to H100. Max throughput improves due to its dependence on memory capacity and bandwidth, benefitting from the new HBM3e. First token latency is compute bound for most ISLs, meaning H200 retains similar time to first token as H100.

For practical examples of H200's performance:

**Max Throughput TP1:**
 an offline summarization scenario (ISL/OSL=2048/128) with Llama-70B on a single H200 is 1.9x more performant than H100.

**Max Throughput TP8:**
an online chat agent scenario (ISL/OSL=80/200) with GPT3-175B on a full HGX (TP8) H200 is 1.6x more performant than H100.

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/H200launch_tps.png?raw=true" alt="H200 TPS" width="500" height="auto">

<sub>Preliminary measured performance, subject to change.
TensorRT LLM v0.5.0, TensorRT v9.1.0.4. | Llama-70B: H100 FP8 BS 8, H200 FP8 BS 32 | GPT3-175B: H100 FP8 BS 64, H200 FP8 BS 128 </sub>


**Max Throughput across TP/BS:**
Max throughput<sup>(3)</sup> on H200 vs H100 varies by model, sequence lengths, BS, and TP. Below results shown for maximum throughput per GPU across all these variables.

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/H200launch_H200vsH100_tps.png?raw=true" alt="max throughput llama sweep" width="500" height="auto">

<sub>Preliminary measured performance, subject to change.
TensorRT LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. </sub>


<sup>*(3) Max Throughput per GPU is defined as the highest tok/s per GPU, swept across TP configurations & BS powers of 2.*</sup>


### Latest HBM Memory

H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second.  These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT LLM.

---

# New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

XQA kernel provides optimization for [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/abs/2305.13245v3) during generation phase. It also provides optimization for beam search. Using tensor cores for acceleration, reducing data loading and conversion, it delivers increased throughput within the same latency budget. Increased throughput allows serving greater number of user requests while providing the same experience.

Support matrix and usage flags are described in [docs/source/advanced/gpt_attention](/docs/source/advanced/gpt-attention.md#xqa-optimization).

**Increased Throughput:**
Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.


<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">

<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a</sub>


## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget


**H200 2.4x with XQA**


|Model     |GPUs | Input Length | Output Length | Throughput w/o XQA (tok/s/GPU) | Throughput w/ XQA (tok/s/GPU) | Speedup |
|:---------|:----|:-------------|:--------------|:-------------------|:------------------|:--------|
|Llama-70B |   1 |          128 |          2048 |              1,227 |             2,941 | 2.4x
|          |   8 |          128 |          2048 |             13,232 |            25,300 | 1.9x


###### Closing

These improvements will be published in the `main` branch soon, and will be
included in the v0.8 releases.

For more information about H200, please see the [H200 announcement blog](./H200launch.md).

Throughput is calculated as output tokens per second per gpu.
`out_tps=output_seqlen*batch_size/total_latency/tp`

<sub> **Glossary:**
| DP  = Data Parallel
  ISL = Input Sequence Length
| PP  = Pipeline Parallel
| OSL = Output Sequence Length
| OOM = Out of Memory
| TP  = Tensor Parallel <sub/>

---

# Speed up inference with SOTA quantization techniques in TRT-LLM

The deployment and inference speed of LLMs are often impeded by limitations in memory capacity, memory bandwidth, and computation power. Quantization emerges as a vital strategy to address these bottlenecks, involving representing weights and activations with lower-precision data types like [FP8](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52166/).

In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.

## Quantization in TensorRT-LLM
TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.

## Benchmark

### Performance
In the following benchmark, we highlight the acceleration of a few popular models at a small batch size without imposing latency constraints. It's important to note that in scenarios where there's a latency constraint in your application, TRT-LLM can achieve an even greater performance improvement. Using LLaMA-v2-7B as an example, when the first token latency is constrained to be under 500ms, quantization with FP8 and a batch size of 16 achieves a notable **2.3x inference speedup** compared to FP16 on a H100.

| Model       | Batch Size | Speedup (FP8 v.s. FP16) | Speedup (INT8 SQ v.s. FP16) |
| ----------- | :--------: | :---------------------: | :-------------------------: |
| GPT-J       |     1      |          1.40x          |            1.40x            |
| GPT-J       |     8      |          1.44x          |            1.30x            |
| LLaMA-v2-7B |     1      |          1.51x          |            1.47x            |
| LLaMA-v2-7B |     8      |          1.40x          |            1.32x            |

*The above benchmarks were run with Input Length=1024, Output Length=128, and TP=1 on H100 80GB.

### Accuracy

| Model        | Quantization Methods | MMLU Baseline (FP16) | MMLU Post-quantization | MMLU Loss |
| ------------ | :------------------: | :------------------: | :--------------------: | :-------: |
| Falcon-180B  |         FP8          |         70.4         |          70.3          |   0.14%   |
|              |       INT8-SQ        |         70.4         |          68.6          |   2.56%   |
|              |       INT4-AWQ       |         70.4         |          69.8          |   0.85%   |
| Falcon-40B   |         FP8          |         56.1         |          55.6          |   0.89%   |
|              |       INT8-SQ        |         56.1         |          54.7          |   2.50%   |
|              |       INT4-AWQ       |         56.1         |          55.5          |   1.07%   |
| LLaMA-v2-70B |         FP8          |         69.1         |          68.5          |   0.87%   |
|              |       INT8-SQ        |         69.1         |          67.2          |   2.75%   |
|              |       INT4-AWQ       |         69.1         |          68.4          |   1.01%   |
| MPT-30B      |         FP8          |         47.5         |          47.4          |   0.21%   |
|              |       INT8-SQ        |         47.5         |          46.8          |   1.47%   |
|              |       INT4-AWQ       |         47.5         |          46.5          |   2.11%   |


## Best practices to choose the right quantization methods
A quantization method comprises three primary components:
1. Weight precision format
2. Activation precision format
3. Calibration algorithms

Typically, in the context of small-batch inference scenarios (batch size ≤ 4), the key consideration is memory bandwidth, making weight-only quantization methods the preferred choice. Conversely, for large-batch inference scenarios, such as serving scenarios (batch size ≥ 16), both memory bandwidth and computation density become crucial factors. Consequently, it's recommended to opt for a quantization method that has both weight and activation quantized. For batch size ≥ 16, the choice of quantization method can be model specific. We suggest to prioritize using FP8 first, as we typically see it offers the best performance and accuracy. If the results do not meet your specific use case, you can further experiment with Int8 SmoothQuant (Int8 SQ) followed by AWQ and/or GPTQ.

Based on specific use cases, users might have different tolerances on accuracy impact and calibration time. The table below summarizes the tradeoffs* to consider when choosing a quantization method. You can also learn more about precision formats in our [documentation](https://nvidia.github.io/TensorRT-LLM/reference/precision.html).

| Quantization Methods     | Performance Improvement (batch size <= 4) | Performance Improvement (batch size >= 16) | Accuracy Impact | Calibration Time** |
| :----------------------- | :---------------------------------------: | :----------------------------------------: | :-------------: | :----------------: |
| FP8 (W8A8)               |                  Medium                   |                   Medium                   |    Very Low     |      Minutes       |
| Int8 SQ (W8A8)           |                  Medium                   |                   Medium                   |     Medium      |      Minutes       |
| Int8 weight-only (W8A16) |                  Medium                   |                    Low                     |       Low       |    Not Required    |
| Int4 weight-only (W4A16) |                   High                    |                    Low                     |      High       |    Not Required    |
| Int4 AWQ (W4A16)         |                   High                    |                    Low                     |       Low       |  Tens of Minutes   |
| Int4 GPTQ                |                   High                    |                    Low                     |       Low       |  Tens of Minutes   |
| Int4-FP8 AWQ (W4A8)      |                   High                    |                   Medium                   |       Low       |  Tens of Minutes   |

\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
** Calibration time is subject to the actual model size.

We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.

## What’s coming next
TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.

---

# ADP Balance Strategy

By NVIDIA TensorRT LLM team

## Table of Contents
- [ADP Balance Strategy](#adp-balance-strategy)
  - [Table of Contents](#table-of-contents)
  - [Motivation and Background](#motivation-and-background)
  - [Theoretical Analysis and Modeling](#theoretical-analysis-and-modeling)
    - [Mathematical Modeling](#mathematical-modeling)
    - [Scheduling Strategies for Load Balancing](#scheduling-strategies-for-load-balancing)
      - [Baseline: Round-Robin Token Distribution](#baseline-round-robin-token-distribution)
      - [ADP Balance Strategy: Coordinated Waiting Mechanism](#adp-balance-strategy-coordinated-waiting-mechanism)
    - [Performance Analysis: Baseline vs. ADP Balance](#performance-analysis-baseline-vs-adp-balance)
  - [Experiments](#experiments)
    - [Setting](#setting)
      - [Dataset Configuration](#dataset-configuration)
      - [Hardware and Model Configuration](#hardware-and-model-configuration)
    - [Performance Results](#performance-results)
      - [Performance Summary](#performance-summary)
      - [Baseline Performance](#baseline-performance)
      - [ADP Balance with Context Wait Implementation](#adp-balance-with-context-wait-implementation)
      - [ADP Balance with Full Strategy Implementation](#adp-balance-with-full-strategy-implementation)
    - [Pareto Analysis: Throughput-Latency Trade-off Optimization](#pareto-analysis-throughput-latency-trade-off-optimization)
  - [Conclusion](#conclusion)
  - [Acknowledgement](#acknowledgement)

## Motivation and Background

In DeepSeek MLA + MoE architectures under maximum-throughput scenarios, an Attention Data Parallel (ADP) + MoE Expert Parallel (EP) strategy is commonly employed to eliminate redundant KV cache storage, and utilize disaggregated serving to prevent ADP imbalances. However, certain deployment scenarios still favor In-Flight Batching (IFB) inference, including:

- **System complexity reduction**: Avoiding the operational overhead and maintenance costs associated with disaggregated architectures
- **Specific workload patterns**: Scenarios with short input sequence lengths (ISL) and long output sequence lengths (OSL)
- **Offline inference**: Batch processing environments where Time-To-First-Token (TTFT) and Time-To-Output-Token (TPOT) requirements are more relaxed

However, IFB introduces significant load imbalance challenges within the Attention module that severely impact system performance. The core issue arises when different ranks simultaneously handle heterogeneous workloads within the same iteration. For instance, some ranks may be processing computationally intensive context phases while others execute generation phases, creating substantial disparities in token processing loads. This bottlenecks the overall system throughput, as the iteration time is dominated by the slowest rank.

To address this critical performance limitation, we introduce the **ADP (Attention Data Parallel) Balance Strategy**—a novel scheduling optimization designed to achieve optimal load distribution across DP ranks and maximize system utilization.

## Theoretical Analysis and Modeling

**Optimization Objective**: Minimize load imbalance across different GPU ranks to maximize overall system throughput.

### Mathematical Modeling

We model and quantify the performance impact of load imbalance in Attention DP. Since workloads across ranks can be heterogeneous, the execution time for the Attention module in any given iteration is bounded by the rank with the highest workload:

$$
time_i = \max_{0 \leq m < N} time_{i,m}
$$

where $time_{i,m}$ represents the execution time of rank $m$ in iteration $i$, and $N$ is the data parallel size.

To quantify load balance and theoretical performance bounds, we define two key metrics:

#### 1. Balance Ratio
The balance ratio measures the load distribution across ranks within the Attention module for each iteration:

$$
balance = \frac{tokens_{avg}}{tokens_{max}}
$$

where:
- $tokens_{avg}$ represents the average number of tokens across all ranks  
- $tokens_{max}$ represents the maximum number of tokens across all ranks
- $tokens_i$ represents the number of tokens processed by rank $i$

Note: MoE module load balancing is handled separately by the Expert Parallel Load Balancer (EPLB) module and is not considered during the early scheduling phase.

#### 2. Speed-of-Light Throughput (SOL TPS)
The Speed-of-Light throughput represents the theoretical upper-bound throughput achievable with perfect load balancing:

$$
time_{sol} = \sum_{i=0}^{\infty} time_i \times balance
$$

$$
tps_{sol} = \frac{time_{elapsed}}{time_{sol}} \times tps_{actual}
$$

where:
- $time_i$: Measured execution time of iteration $i$
- $time_{elapsed}$: Total empirically measured end-to-end execution time
- $tps_{actual}$: Observed throughput in tokens per second
- $tps_{sol}$: Theoretical maximum throughput under perfect load balance

This theoretical framework enables us to quantify the performance gap between current and optimal system utilization, providing clear targets for optimization.

### Scheduling Strategies for Load Balancing

The fundamental challenge in Attention DP is that ranks can process vastly different token loads within the same iteration, causing the overall execution time to be bottlenecked by the most heavily loaded rank.

#### Baseline: Round-Robin Token Distribution

The conventional approach employs a global load balancing strategy that sorts incoming requests by `num_tokens` and distributes them across ranks using round-robin scheduling, as illustrated in Figure 1. This method achieves reasonable token distribution from a cumulative perspective and effectively reduces token count disparities when all ranks are simultaneously processing context requests.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png">
</figure>
</div>
<p align="center"><sub><em>Figure 1: Baseline round-robin strategy balances context request tokens across ranks through sorting and cyclic distribution</em></sub></p>

**Limitations**: While effective globally, this approach fails to guarantee per-iteration load balance. A critical scenario arises when some ranks process context phases, while others handle generation (decode), creating severe load imbalances that dominate overall execution time.

#### ADP Balance Strategy: Coordinated Waiting Mechanism

To address the per-iteration load imbalance problem, we propose the **ADP Balance Strategy**, which employs a sophisticated waiting mechanism to synchronize context processing across ranks. The core principle is strategic delay: instead of immediately scheduling context requests to available ranks, the system waits strategically to ensure multiple ranks have similar workloads before proceeding.

**Algorithm Design**: The strategy introduces two complementary control parameters:

**1. Context Synchronization (`timeout_iters`)**
- **Purpose**: Ensures temporal alignment of context processing across ranks
- **Mechanism**: When a rank becomes available for context processing while others remain in generation phases, it waits up to `timeout_iters` iterations until all other ranks have context tasks
- **Benefit**: Prevents the scenario where one rank processes context tasks while others handle generation tasks

**2. Batch Equilibration (`batching_wait_iters`)**
- **Purpose**: Balances the number of accumulated context batches across ranks
- **Mechanism**: After initial synchronization, ranks with fewer accumulated context batches wait up to `batching_wait_iters` additional iterations to accumulate more batches
- **Benefit**: Prevents load imbalances caused by uneven context batch accumulation, where some ranks may have multiple batches while others have only one

### Performance Analysis: Baseline vs. ADP Balance

To illustrate the effectiveness of our approach, consider a simplified scenario where:

- All ranks have equal-length contexts and M ongoing requests
- N new requests arrive sequentially over N iterations.
- Context processing time: `time(ctx)` >> Generation processing time: `time(gen)`

**Baseline Behavior:**
In the traditional approach, contexts are processed sequentially across ranks, resulting in severe load imbalances:

```text
iter_i:     [*C0*, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM]
iter_i+1:   [g00, g01, ..., g0M], [*C1*, g11, ..., g1M], ..., [gN0, gN1, ..., gNM]
...
iter_i+N-1: [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [*CN*, gN1, ..., gNM]
```

*Legend: `*Ci*` = context request i, `gij` = generation request j on rank i*

- **Per-iteration time**: `time(ctx)` (dominated by context processing)
- **Total execution time**: `time(ctx) × N`
- **Balance ratio**: `(ctx_len + (M-1) + M × (N-1)) / (N × ctx_len)` (poor balance)

**ADP Balance Strategy:**
Our method synchronizes context processing by strategic waiting:

```text
iter_i:     [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM]
iter_i+1:   [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM]
...
iter_i+N-1: [*C0*, g01, ..., g0M], [*C1*, g11, ..., g1M], ..., [*CN*, gN1, ..., gNM]
```

- **Per-iteration time**: `time(gen)` for first N-1 iterations, `time(ctx)` for final iteration
- **Total execution time**: `time(gen) × (N-1) + time(ctx)`
- **Balance ratio**: 1.0 (perfect balance)
- **Time savings**: `(time(ctx) - time(gen)) × (N-1)`

**Trade-offs:**

- ✅ **Throughput improvement** due to optimal load balancing
- ✅ **Maximized GPU utilization** across all ranks
- ⚠️ **Increased TTFT** due to strategic waiting mechanism
- 📋 **Best suited for** throughput-oriented scenarios where TTFT is not critical

## Experiments

### Setting

#### Dataset Configuration
We evaluate our approach using a comprehensive dataset comprising 16,000 inference requests with the following characteristics:

- **Request volume**: 16,000 total requests
- **Average input length**: 803 tokens
- **Average output length**: 3,653 tokens
- **Token distribution**: Figure 2 illustrates the distribution patterns for both input and output sequences

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_dataset_token_distribution.png">
</figure>
</div>
<p align="center"><sub><em>Figure 2: Distribution of input and output token lengths</em></sub></p>

**Dataset Characteristics**: The dataset exhibits significant diversity in sequence lengths, with output tokens following a pronounced long-tail distribution. This heterogeneity presents substantial challenges for load balancing, as it becomes difficult to co-schedule multiple context requests within the same iteration while minimizing computational bubbles—making it an ideal testbed for evaluating our scheduling strategy.

#### Hardware and Model Configuration
**Infrastructure**:
- **Platform**: NVIDIA Blackwell GB200 system
- **GPU Count**: 8 × GB200 GPUs
- **Model**: DeepSeek V3
- **Parallelization Strategy**:
  - Attention module: Data Parallel (DP) size = 8
  - MoE module: Expert Parallel (EP) size = 8


### Performance Results

We evaluate three distinct configurations to demonstrate the progressive benefits of our ADP Balance strategy:

1. **Baseline**: Round-robin scheduling
2. **ADP Balance (Context Wait)**: Implementing `timeout_iters` parameter only
3. **ADP Balance (Full Strategy)**: Complete implementation with both `timeout_iters` and `batching_wait_iters`

#### Performance Summary

| Configuration | Actual TPS | Avg Balance Ratio | SOL TPS | Speedup |
|---------------|------------|-------------------|-------------------|---------|
| Baseline | 25,664 | 54.11% | 39,552 | 1.00× |
| ADP Balance (Context Wait) | 33,499 | 84.33% | 38,312 | 1.31× |
| ADP Balance (Full Strategy) | 34,140 | 87.70% | 37,912 | 1.33× |

**Key Observations**:
- Context Wait alone delivers a substantial **31% throughput improvement**
- Full strategy achieves **33% total speedup** with near-optimal balance (87.70%)
- Balance ratio improvement: **54% → 87%** represents a dramatic reduction in load imbalance

*Note: The decrease in SOL TPS with waiting mechanisms occurs because the strategic delays in context scheduling increase the total number of iterations required to complete all requests. Since SOL TPS calculation only accounts for load imbalance effects within each iteration, it doesn't reflect the iteration count increase caused by delayed context entry, leading to an apparent reduction despite overall system efficiency improvements.*

#### Baseline Performance

Figure 3 provides comprehensive insight into baseline system behavior, displaying both average tokens across ranks (top) and the corresponding balance ratio (bottom) by iteration. The balance ratio serves as a key indicator: values approaching 1.0 represent optimal balance, while values near 0.0 indicate severe imbalances.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_overview.png">
</figure>
</div>
<p align="center"><sub><em>Figure 3: Baseline performance overview showing token distribution and balance ratios across all iterations</em></sub></p>

**Critical Insights**:
- **Imbalance window**: Most severe imbalances occur within the first 12,000 iterations, as evidenced by the average token distribution showing that all context processing phases occur within this critical interval
- **Performance gap**: SOL TPS of 39,552 vs. actual TPS of 25,664 reveals a **54% relative performance gap**
- **System behavior**: After iteration 12,000, all requests transition to generation phase, naturally reducing imbalances

Figure 4 zooms into the critical imbalance period [100-12,000], revealing the dramatic instability in load distribution:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_detail.png">
</figure>
</div>
<p align="center"><sub><em>Figure 4: Detailed baseline analysis for iterations 100-12,000 showing severe balance fluctuations</em></sub></p>

**Performance Bottlenecks**:
- Balance ratio frequently drops to **0.4 or lower**, indicating 60%+ load imbalance
- Theoretical improvement potential of **70.23%** within the critical window
- Extreme volatility in load distribution creates unpredictable performance characteristics

#### ADP Balance with Context Wait Implementation

The Context Wait mechanism (`timeout_iters=50`) demonstrates the effectiveness of our first optimization component, achieving substantial performance improvements through context synchronization.

**Performance Achievements**:
- **Throughput**: 33,499 TPS (1.31× speedup)
- **Balance improvement**: 84.33% average (vs. 54.11% baseline)
- **Efficiency**: Actual TPS significantly closer to theoretical SOL TPS (38,312)

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_context_wait_performance.png">
</figure>
</div>
<p align="center"><sub><em>Figure 5: Context Wait performance showing improved balance stability for iterations 100-12,000</em></sub></p>

**Remaining Challenges**:
Despite significant improvements, residual imbalances persist due to:

1. **Timeout scenarios**: Some ranks exceed the waiting threshold when context requests don't arrive uniformly
2. **Batch accumulation disparity**: Longer-waiting ranks accumulate multiple context batches while recently-freed ranks process single batches
3. **Partial synchronization**: While initial synchronization succeeds, subsequent load variations still occur

This analysis motivated the development of our second optimization component: batch equilibration.

#### ADP Balance with Full Strategy Implementation

The complete ADP Balance strategy combines both context synchronization and batch equilibration mechanisms, delivering optimal load balancing performance.

**Configuration**: `timeout_iters=50` + `batching_wait_iters=10`

**Performance Optimization Results**:
- **Peak throughput**: 34,140 TPS (1.33× speedup)
- **Optimal balance**: 87.70% average balance ratio
- **Near-theoretical efficiency**: Actual TPS (34,140) approaches SOL TPS (37,912)
- **System stability**: Dramatically reduced load variance across iterations

The effectiveness of our complete ADP Balance implementation is clearly demonstrated in Figure 6. The visualization reveals how the combination of context synchronization and batch equilibration mechanisms achieves near-optimal load balancing throughout the critical execution window.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_full_strategy_performance.png">
</figure>
</div>
<p align="center"><sub><em>Figure 6: Full ADP Balance strategy demonstrating superior balance stability for iterations 100-12,000</em></sub></p>

**Key Improvements Over Context Wait**:
- **Enhanced stability**: Balance ratio maintains consistently higher values with reduced volatility
- **Residual imbalance mitigation**: Batch equilibration addresses the remaining load disparities
- **System predictability**: More uniform performance characteristics across iterations

**Implementation Trade-offs**:
- ✅ **Maximum throughput improvement**: 33% gain over baseline
- ✅ **Near-optimal load balancing**: 87.70% average balance ratio
- ⚠️ **Iteration overhead**: Waiting mechanisms increase total iteration count
- ⚠️ **TTFT impact**: Strategic delays affect time-to-first-token metrics

**Production Configuration**:
Users can enable the full ADP Balance strategy by adding the following configuration:

```yaml
attention_dp_config:
    enable_balance: true
    batching_wait_iters: 10
    timeout_iters: 50
```

### Pareto Analysis: Throughput-Latency Trade-off Optimization

Understanding the performance trade-offs inherent in our ADP Balance strategy is crucial for production deployment decisions. Figure 7 presents a comprehensive Pareto frontier analysis that maps the relationship between system throughput (TPS per GPU) and Time-To-First-Token (TTFT) across varying workload intensities and parameter configurations.

**Experimental Design**: The analysis evaluates multiple configurations of `timeout_iters` (TO) and `batching_wait_iters` (BW) parameters under different system load conditions, revealing how parameter tuning affects the fundamental throughput-latency trade-off.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png">
</figure>
</div>
<p align="center"><sub><em>Figure 7: Pareto frontier analysis showing throughput-latency trade-offs across different ADP Balance configurations</em></sub></p>

**Key Findings**:

1. **Universal Throughput Gains**: ADP Balance consistently delivers superior TPS/GPU performance across the entire operational spectrum, from latency-sensitive to throughput-maximized deployments

2. **Scalability Benefits**: Performance improvements become increasingly pronounced under higher system loads, where load imbalance penalties are most severe

3. **TTFT Trade-off**: Throughput gains necessitate increased first-token latency due to the strategic waiting mechanisms, with higher parameter values yielding greater throughput at the cost of longer response initiation

4. **Configuration Guidance**: 
   - **Low-load scenarios**: `batching_wait_iters` provides minimal benefit while adding latency overhead
   - **High-throughput scenarios**: Both parameters contribute significantly to performance optimization
   - **Balanced deployments**: Moderate parameter values offer optimal throughput-latency balance

**Production Implications**: This analysis empowers system operators to make data-driven configuration decisions based on specific deployment requirements—whether optimizing for minimum response latency or maximum system throughput.


## Conclusion

Load imbalance in Attention Data Parallel processing represents a fundamental bottleneck in large language model inference systems, particularly under In-Flight Batching scenarios where heterogeneous workloads create severe performance penalties. This work introduces the **ADP Balance Strategy**—a sophisticated scheduling optimization that addresses this critical challenge through coordinated waiting mechanisms.

**Technical Contributions**:
Our approach employs two complementary optimization components: context synchronization (`timeout_iters`) and batch equilibration (`batching_wait_iters`). These mechanisms work in concert to achieve temporal alignment of computationally intensive context processing across data parallel ranks, effectively eliminating the performance bottlenecks caused by rank-level load imbalances.

**Experimental Validation**:
Comprehensive evaluation on the DeepSeek V3 architecture demonstrates compelling performance improvements:
- **33% throughput increase**: From 25,664 to 34,140 TPS
- **87% load balance achievement**: Dramatic improvement from 54% baseline
- **Near-theoretical efficiency**: Actual performance approaching speed-of-light throughput bounds

**Production Readiness**:
The Pareto frontier analysis provides critical insights for real-world deployment, revealing that while the strategy introduces TTFT trade-offs, it consistently delivers superior throughput across diverse operational scenarios. The configurable parameter framework enables operators to optimize for their specific performance requirements, whether prioritizing response latency or system throughput.


## Acknowledgement

The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.

---

## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)

This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.

### Prerequisites

- NVIDIA GB200 or B200 GPUs (example below assumes 8 GPUs; adjust flags for your setup)
- Fast SSD storage for model weights
- Base model weights available under a directory named `gpt-oss-120b` (example path)
- Eagle3 speculative model assets available under a directory named `eagle`

Expected directory layout on the host (example):

```
/path/to/models/
  ├─ gpt-oss-120b/  # base model directory
  └─ eagle/         # Eagle3 speculative decoding assets
```

### Get the TensorRT LLM Container (1.1.0rc0)

If required by your environment, log into NGC and pull the image:

```bash
# Create an API key at https://ngc.nvidia.com (if you don't have one)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key>

docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
```

### Start the TensorRT LLM Container

Run the container and bind-mount your models directory to `/config/models` inside the container:

```bash
docker run --rm --ipc=host -it \
  --ulimit stack=67108864 \
  --ulimit memlock=-1 \
  --gpus all \
  -p 8000:8000 \
  -v /path/to/models:/config/models:rw \
  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
  /bin/bash
```

Replace `/path/to/models` with the absolute path on your host.

### Download the models (Base + Eagle3)

Inside the container, download the base model and the Eagle3 speculative model to the expected directories under `/config/models/`:

```bash
# Optional: authenticate if the repository requires it
# export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential

pip install -q "huggingface_hub[cli]"

# Base model: openai/gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b \
  --local-dir /config/models/gpt-oss-120b \
  --repo-type model

# Eagle3 model assets
mkdir -p /config/models/eagle
huggingface-cli download nvidia/gpt-oss-120b-Eagle3 \
  --local-dir /config/models/eagle \
  --repo-type model
```

References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`

### Create the Eagle3 Configuration

Inside the container, create the YAML file at `/config/models/eagle/eagle.yaml` with the following content:

```bash
mkdir -p /config/models/eagle
cat > /config/models/eagle/eagle.yaml << 'EOF'
trust_remote_code: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
speculative_config:
  decoding_type: Eagle
  max_draft_len: 3
  speculative_model_dir: /config/models/eagle/
cuda_graph_config:
  max_batch_size: 10
use_torch_sampler: true
moe_config:
  backend: TRTLLM
EOF
```

Notes:
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.

### Launch the Server (Eagle3 Speculative Decoding)

Run the following command inside the container to start the endpoint:

```bash
TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --config /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
```

The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.

### Quick Health Check

From another terminal on the host, verify that the server is healthy:

```bash
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When `Status: 200` is returned, the endpoint is ready to serve requests.

### Sample Chat Completions Request

Note: This Eagle3 + TensorRT LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.

Send a simple OpenAI-compatible Chat Completions request to the running server:

```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
    ],
    "max_tokens": 128,
    "stream": false
  }'
```

---

# Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly

*By NVIDIA TensorRT LLM Team and the XGrammar Team*

## Table of Contents
- [Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly](#combining-guided-decoding-and-speculative-decoding-making-cpu-and-gpu-cooperate-seamlessly)
  - [Table of Contents](#table-of-contents)
  - [Background and Challenges](#background-and-challenges)
    - [Motivation](#motivation)
    - [Guided Decoding](#guided-decoding)
    - [Speculative Decoding](#speculative-decoding)
    - [Two Challenges](#two-challenges)
  - [Trace Grammar State for Draft Token Proposal and Rejection](#trace-grammar-state-for-draft-token-proposal-and-rejection)
    - [Target Model](#target-model)
    - [Draft Model](#draft-model)
  - [Make Grammar Computation Capturable by CUDA Graph](#make-grammar-computation-capturable-by-cuda-graph)
    - [CUDA Callback](#cuda-callback)
    - [Integration to TensorRT LLM Python Runtime](#integration-to-tensorrt-llm-python-runtime)
    - [CUDA Graph Compatibility: Grammar Computation](#cuda-graph-compatibility-grammar-computation)
    - [CUDA Graph Compatibility: Mask Applying Kernel](#cuda-graph-compatibility-mask-applying-kernel)
    - [Troubleshooting: Data Race between Host and CUDA Callback](#troubleshooting-data-race-between-host-and-cuda-callback)
    - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#troubleshooting-deadlock-by-gil-and-cuda-mutex)
  - [Performance and Analysis](#performance-and-analysis)
  - [Acknowledgements](#acknowledgements)

## Background and Challenges

### Motivation

As part of our effort to bridge gaps in feature combinations, we enabled guided decoding with many important LLM inference features in TensorRT LLM over the last two months:

* Overlap scheduler: [PR 6000](https://github.com/NVIDIA/TensorRT-LLM/pull/6000)
* CUDA graph padding: [PR 6774](https://github.com/NVIDIA/TensorRT-LLM/pull/6774)
* Disaggregated serving: [PR 6704](https://github.com/NVIDIA/TensorRT-LLM/pull/6704)
* Speculative decoding (two-model implementation): [PR 6300](https://github.com/NVIDIA/TensorRT-LLM/pull/6300)
* Speculative decoding (one-model implementation): [PR 6948](https://github.com/NVIDIA/TensorRT-LLM/pull/6948)

More complicated (higher-order) combinations are also supported; for example, we can run DeepSeek-R1 with guided decoding, overlap scheduler, CUDA graph, [attention data parallelism (ADP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md), [multiple token prediction (MTP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md) and [disaggregated serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)​ all enabled.

Among all these tasks, combining guided decoding with one-model speculative decoding is the most challenging one, and it achieves the best performance for low-latency or throughput@latency scenarios. This blog post shares the overall design, implementation details, and performance analysis.

### Guided Decoding

Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are amenable to a user-specified grammar (e.g., JSON schema), which is particularly useful for LLM agents. For example, guided decoding can help an LLM generate function arguments that strictly conform to function signatures. Thus, the LLM can correctly call external tools and integrate the tool calling results for a better response.

For a request at the prefill phase, guided decoding creates an initial grammar state (i.e., grammar compilation), and generates a mask tensor indicating which tokens from the vocabulary are allowed for the first generated token (i.e., mask gen). At each generation phase, guided decoding advances the grammar state based on the last generated token (i.e., grammar advance), and generates a mask tensor for the next token. The mask will be applied to the logits to mask out the disallowed tokens before sampling (i.e., mask applying), which ensures the next token is amenable to the grammar constraints.

TensorRT LLM integrates third-party grammar backends (e.g., [XGrammar](https://github.com/mlc-ai/xgrammar), [LLGuidance](https://github.com/guidance-ai/llguidance)) for the grammar computation. Currently, these grammar backends are implemented on CPU, so the grammar computation introduces significant CPU overhead. Fortunately, this can be overlapped with the GPU computation, achieving [near-zero overhead](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). The core idea is that at every iteration, we should first launch the model forward to make the GPU busy, and then compute grammar compilation/advance and mask gen on CPU. Once both the computations finish, the mask can be applied to the logits before sampling.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_constrained_decoding_pipeline_overlap.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 1: Top: guided decoding timeline without overlapping. Bottom: guided decoding timeline with overlapping. (This figure is from the XGrammar paper.)</em></sub></p>

### Speculative Decoding

Speculative decoding is a crucial feature in low-latency or throughput@latency LLM inference scenarios. For each request, a lightweight drafter proposes several draft tokens, and then the target model verifies the draft tokens in parallel. Hopefully, most draft tokens are accepted, and thus multiple tokens are generated in a single target model forward. Compared with normal LLM inference where each model forward generates a single token, speculative decoding offers the potential to generate more tokens per iteration by leveraging more computation. This improves the arithmetic intensity and reduces the required number of iterations.

TensorRT LLM has two kinds of speculative decoding implementations, namely the one-model and two-model implementations. The one-model implementation launches a single CUDA graph for a target model forward together with multiple draft model forwards. This is more difficult to implement and is coupled with the modeling code, but it offers the best performance. The two-model implementation decouples the target and draft models into separate CUDA graphs, which is much more flexible and offers better feature coverage. There are ongoing efforts to close the gaps between the two implementations.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_one_model_vs_two_model.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 2: Top: GPU timeline of one-model speculative decoding. Bottom: GPU timeline of two-model speculative decoding.</em></sub></p>

### Two Challenges

When combining guided decoding and speculative decoding, two challenges arise. First, at each generation iteration, speculative decoding proposes multiple draft tokens, some of which might be rejected in the verification step. The draft token proposal and rejection are not transparent to guided decoding. Specifically, this can be broken down into two views:

* For the target model, guided decoding should advance the grammar state and generate the mask for every draft token. If some draft tokens are rejected, guided decoding should rollback the grammar state to the last accepted token.
* For the draft model, without grammar constraints, some draft tokens may violate the grammar and thus be forcefully rejected in the verification step. Clearly, this hurts the acceptance rate. Hence, guided decoding should also intervene on the logits for every draft token generation if possible.
  * Some speculative algorithms propose draft tokens recurrently by computing logits and sampling (e.g., the standard draft-target model, EAGLE or MTP), similarly to a standard LLM. In that case, guided decoding can apply grammar constraints in a similar mask gen and applying way.
  * Some drafting algorithms work without logits sampling, which require other ways to apply the grammar constraints.

Second, specific to the one-model speculative decoding where a single CUDA graph contains multiple (draft and target) model forwards, the CPU-GPU synchronization becomes challenging. Note that for every step $i$, there are two event waits:

* The host waits for the *token event* that indicates the readiness of CPU tokens from step $i-1$.
* The model forward stream waits for the *mask event* that indicates the readiness of GPU masks from step $i$.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 3: The CPU-GPU synchronization for multiple model forwards.</em></sub></p>

Note that in the two-model implementation, the sampling is excluded from the CUDA graphs for better flexibility (Figure 2). From the CPU perspective, this offers a timing for the grammar computation. In particular, the mask event wait can be inserted between the CUDA graph replay and sampling, effectively making the GPU wait for the GPU masks asynchronously copied from CPU.

However, the CUDA graph of the one-model implementation contains multiple forwards, inevitably including the sampling operations. Hence, there is no timing for the grammar computation. The most outstanding problem is that when replaying the CUDA graph, the mask event wait cannot be inserted before sampling. An alternative is capturing the events and waits in the CUDA graph, but it is still ineffective because the grammar computation is on CPU and thus not capturable. Once such a CUDA graph is launched to replay, the GPU does not wait for any newly recorded events, so it is impossible to block the GPU for the readiness of masks.

## Trace Grammar State for Draft Token Proposal and Rejection

### Target Model

For a target model forward, a request should have one new token and multiple draft tokens from the last verification step and drafter, respectively. For each token in the sequence, guided decoding should advance the grammar state and fill the mask tensor. Before sampling, the masks should be applied to the corresponding logits. After verification, the grammar state should be rolled back by the number of rejected tokens.

Compared to guided decoding with non-speculative decoding, the rollback operation is newly introduced. Thankfully, it has built-in support by grammar backends like [XGrammar](https://github.com/mlc-ai/xgrammar/blob/v0.1.21/python/xgrammar/matcher.py#L341-L350) and [LLGuidance](https://github.com/guidance-ai/llguidance/blob/v1.1.1/python/llguidance/_lib.pyi#L363-L366).

Before proceeding to the draft model view, note that the LLM can generate correct outputs as long as we apply grammar constraints on the target model, because any draft tokens violating the grammar will be forcefully rejected by the verification step. However, this hurts the acceptance rate.

### Draft Model

As aforementioned, we can apply grammar constraints for draft tokens in a similar mask gen and applying way for speculative algorithms based on recurrent logits sampling. Specifically, for the first drafting step, guided decoding advances the grammar state using the last new token. For the following drafting steps, the grammar state is advanced using the last draft token. Each step should fill and apply the mask to the corresponding draft model logits before sampling. 

After the drafting process, the grammar state should be rolled back to the original state, so that the subsequent target model forward can have the correct grammar state. If the draft and target models share the same vocabulary, then the grammar computation is exactly the same so the masks can be reused.

One special case is EAGLE3, whose draft model has a [pruned vocabulary](https://github.com/SafeAILab/EAGLE/blob/58d1de099fe315645a82fe002e46586d54efe405/eagle/traineagle3/config.json#L22-L23) compared to the target model. For instance, LLaMA 3.1 has a 128k vocabulary size, while the corresponding EAGLE3 drafter has a vocabulary containing the most frequent 32k tokens. This saves some computation of the lm_head GEMM. Note that grammar is built on the target model’s vocabulary, so the produced mask cannot be directly applied to the logits of the draft model. EAGLE3 provides a special [d2t](https://github.com/SafeAILab/EAGLE/blob/d7161f9f94aaa345654d9b4045931145811d4d03/eagle/traineagle3/cnets.py#L673-L681) tensor that maps draft token IDs to target token IDs. [PR 7481](https://github.com/NVIDIA/TensorRT-LLM/pull/7481) fuses this d2t mapping to the mask applying kernel.

> **Note:** Here we focus on the chain-based speculative algorithms. A tree-based algorithm would further complicate the implementation; in particular, guided decoding should traverse the drafting tree, advance and rollback grammar states accordingly.

## Make Grammar Computation Capturable by CUDA Graph

### CUDA Callback

CUDA graph can help eliminate the CPU overhead, which is an important technique in the LLM inference systems, especially for the generation phase. As aforementioned, the one-model speculative decoding implementation launches a single CUDA graph to compute multiple draft and target model forwards. This makes the CPU-GPU synchronization challenging: the sampling operation depends on masks computed on CPU, but the GPU is not able to wait for the readiness of any CPU computation once the CUDA graph is launched.

CUDA callback [`cudaLaunchHostFunc`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g05841eaa5f90f27124241baafb3e856f) can launch a host function to a CUDA stream. (The host function should not call any CUDA API.) This has two crucial implications:

* CUDA events and event waits can be inserted before and after the host functions, which can be used to synchronize the CPU and GPU computation.
* The host functions can be captured and replayed by CUDA graph.

Hence, we can launch grammar computation along with other auxiliary host functions as CUDA callbacks to a CUDA stream. The CUDA graph should capture and replay multiple model forwards and corresponding grammar computation all together. To achieve CPU-GPU overlapping, the grammar computation should be placed on a dedicated CUDA stream. Specifically, for every step $i$:

* The grammar stream:
  * waits for the *token event* that indicates the readiness of CPU tokens from step $i-1$;
  * performs grammar advance and mask gen (CUDA callback);
  * asynchronously copies the CPU masks to GPU;
  * records the *mask event*.
* The model forward stream:
  * computes model forward using the last GPU tokens;
  * waits for the *mask event* that indicates the readiness of GPU masks;
  * applies the mask to logits and then samples new tokens;
  * asynchronously copies the GPU tokens to CPU;
  * records the *token event*.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 4: The CPU-GPU synchronization for multiple model forwards by CUDA callback.</em></sub></p>

### Integration to TensorRT LLM Python Runtime

We surveyed some off-the-shelf Python bindings implementations of `cudaLaunchHostFunc`, but it turned out that they do not work well with CUDA graph (e.g., CUDA-Python [Issue 790](https://github.com/NVIDIA/cuda-python/issues/790), cupy [Issue 9274](https://github.com/cupy/cupy/issues/9274)). The probable reason is that the intermediate wrapper data structures are released once the callback is executed; hence, even though the callback is captured by CUDA graph, it cannot be replayed multiple times.

We implement our own bindings to `cudaLaunchHostFunc` — [`launch_hostfunc`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L76). Specifically, `launch_hostfunc` packs the Python function and arguments to an [intermediate data structure](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L33) and calls `cudaLaunchHostFunc` to launch a [trampoline function](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L49) to a CUDA stream. The trampoline function unpacks the intermediate data structure and invokes the Python function with the arguments. Note that `launch_hostfunc` offers great flexibility — it can launch an arbitrary Python function (without any CUDA API calls) as a CUDA callback. Hence, the grammar computation logics can still be implemented in Python.

When CUDA graph is capturing, `launch_hostfunc` does not release the intermediate data structure, so it is accessible during CUDA graph replay. The intermediate data structures can be manually released via [`free_hostfunc_user_data`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L97); otherwise, they are automatically cleaned up when the Python interpreter exists. If CUDA graph is disabled (e.g., prefill phase), the intermediate data structure should be released promptly to avoid memory leak. Specifically, the trampoline function automatically releases it once the callback finishes execution.

In Python, we provide a decorator `hostfunc` which casts an arbitrary Python function to a CUDA callback. For example, run the below code snippet:

```python
import torch
from tensorrt_llm._torch.hostfunc import hostfunc

@hostfunc
def increase(x: torch.Tensor):
    x.add_(1)

x = torch.zeros(10, dtype=torch.int32)

stream = torch.cuda.Stream()
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g, stream=stream):
    increase(x)
    increase(x)
torch.cuda.synchronize()

with torch.cuda.stream(stream):
    for _ in range(10):
        g.replay()

torch.cuda.synchronize()
print(x)
```

The output would look like:

```txt
tensor([20, 20, 20, 20, 20, 20, 20, 20, 20, 20], dtype=torch.int32)
```

Note that the CUDA graph increases the tensor twice, and it is replayed for ten times, so the tensor should be totally increased by 20 times. Clearly, the output validates that the CUDA graph capture and replay are successful.

As the final step, we implemented a variant of `GuidedDecoder` — [`CapturableGuidedDecoder`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L405). It reuses most logics from `GuidedDecoder`, but the grammar computation and some auxiliary methods are decorated by `hostfunc`, making it capturable by CUDA graph.

### CUDA Graph Compatibility: Grammar Computation

Once captured, CUDA graph can be launched to run the same GPU kernels as many times as needed. Note that the replayed kernels are always executed using the fixed input and output memory addresses. By filling input buffers with new data, we can run the same work on new data. This pattern also applies to CUDA callback, except that the input and output buffers are on CPU. 

Guided decoder manages the below buffers and resources:

* [Request states](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L20): All the necessary request information affecting grammar computation, including the user-specified grammar, the last new token and draft tokens.
* [Grammar states](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L167-L168): The grammar states managed by grammar backends. By leveraging the grammar backends, guided decoder advances grammar states and fills mask tensors.
* [New tokens tensor](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L419-L422): The tensor values are copied from the newly computed GPU tokens, and used to update the last new token or draft tokens of the request states.
* [Mask tensor](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L175-L177): The tensor values are filled according to the grammar states and then copied to GPU masks, which will be used to apply to logits.

The buffers are stored in fixed memories, and the resources are accessed via fixed pointers. This makes grammar computation compatible with CUDA graph. The buffers and resources are connected via slot IDs. In the runtime, each request is assigned with an exclusive slot ID (0 <= slot ID < `max_batch_size`) upon the first scheduling. The slot ID is occupied until the request is finished and removed from the scheduler.

When the runtime schedules a new batch of requests, the guided decoder updates the request states on the host. After that, all the other operations (grammar compilation/advance, mask gen, buffer copying, etc.) happen on CUDA streams and should be capturable by CUDA graph. More specifically, buffer copying should be asynchronous, and the other CPU computation should be CUDA callbacks.

### CUDA Graph Compatibility: Mask Applying Kernel

The mask applying kernel takes a batch of logits and masks as the input, and modifies the logits in-place. Specifically, the masked-out (disallowed by grammar) token logits are assigned a value of negative infinity, so that they are impossible to be sampled as the next tokens.

Note that currently CUDA graph is enabled for the generation phase only, and the draft length is fixed for all requests. This greatly simplifies the effort for CUDA graph compatibility. Given a `batch_size` and `max_num_draft_tokens`, the logits tensor is of shape `(batch_size * (1 + max_num_draft_tokens), vocab_size)`. Clearly, we can fill the first `(batch_size * (1 + max_num_draft_tokens))` rows of the mask tensor accordingly, and pass the mask tensor address to the kernel.

Some requests may have no grammar constraints. For such requests, we can fill the corresponding masks with all ones (allowed by grammar) so the logits will not be modified by the kernel, but this causes unnecessary computation. To resolve this, a token-level mask tensor is introduced. The tensor values are filled with zeros for requests without grammar constraints. The kernel reads these mask values and skips the rows with mask values being zero.

### Troubleshooting: Data Race between Host and CUDA Callback

Similar to GPU kernels, CUDA callbacks are asynchronously executed on CUDA streams. Note that both normal host functions and CUDA callbacks can access the same CPU memory addresses, so it can easily cause a data race.

In the initial implementation, `CapturableGuidedDecoder` directly reads request states from [`ScheduledRequests`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/scheduler.py#L18). However, the `ScheduledRequests` is shared through an executor iteration and thus probably modified by other executor components. This creates a potential data race scenario:

* Guided decoder launches a CUDA callback, which will read some request states from `ScheduledRequests`;
* Some other executor components inplace modify `ScheduledRequests`;
* The CUDA callback is executed, reading some modified request states from `ScheduledRequests`.

Clearly, the CUDA callback may read unexpected data. This data race motivates a dedicated request states class — [`GuidedRequest`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L20). It is a request snapshot created for guided decoder only, so it will never be modified by other components. It is also possible that the guided decoder itself may access request states via both normal host functions and CUDA callbacks, so we adopt a protocol that the request snapshots should be created on the host, and then accessed only via CUDA callbacks. This prevents potential data race within an executor iteration.

When overlap scheduler is enabled, another data race scenario exists between executor iterations:

* Iteration $i$ launches CUDA callbacks, which will read request states from a fixed address;
* Iteration $i+1$ updates the request states;
* Iteration $i$'s CUDA callbacks are executed, reading request states updated by iteration $i+1$.

Again, the CUDA callbacks may read unexpected data. A straightforward solution is letting the request state update wait for CUDA callback execution, but this effectively disables overlap scheduling. To resolve this issue and also unblock overlap scheduling, a [queue](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L417) is introduced. For each iteration, a new batch of request states is put into the queue; then, a CUDA callback is launched to fetch a new batch of request states from the queue, and all the subsequent CUDA callbacks access the newly fetched request states. This allows the co-existence of the request snapshots of two (or even more) iterations, which prevents potential data race between iterations.

### Troubleshooting: Deadlock by GIL and CUDA Mutex

After the first version was implemented, the program intermittently encountered a hang issue when `CapturableGuidedDecoder` is enabled. By checking out the callstack, we found that it was hanging on completely irrelevant kernel launches or some other CUDA API calls. With further investigation, we discovered that the hang issue was caused by a deadlock between the Python GIL and a CUDA mutex.

As documented, a CUDA callback must not make any CUDA API calls. This implies that the CUDA callback execution and CUDA API calls compete for the same mutex. Note that the trampoline function needs to [acquire the GIL](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L52) before calling the Python code. Hence, when executing Python code by a CUDA callback, it acquires a CUDA mutex and then the GIL. In the meanwhile, the Python main thread may hold the GIL and make CUDA API calls, so it acquires the GIL and then the CUDA mutex. The two threads acquire the two locks in opposite orders, which creates a deadlock pattern.

This deadlock can be resolved if the Python main thread can release the GIL for CUDA API calls. TensorRT LLM Python runtime is built on PyTorch. Thankfully, PyTorch releases the GIL for most CUDA API calls, even including PyTorch custom operators. However, we find two exceptions in PyTorch 2.8. When creating a device tensor using a shape depending on data from another device tensor, it triggers an implicit and synchronized D2H copy, and this D2H copy is executed with GIL being held ([Issue 163062](https://github.com/pytorch/pytorch/issues/163062)). This can be reproduced by the below code snippet:

```python
import torch

x = torch.randint(0, 100, (100,), dtype=torch.int64, device='cuda')
y = torch.zeros(100, x.max(), dtype=torch.int64, device='cuda')
```

The other case is that `torch.compile` kernels are called with GIL being held ([Issue 163061](https://github.com/pytorch/pytorch/issues/163061)), although Triton kernels are called with GIL released. Hence, we have to avoid any problematic operators and disable `torch.compile` when using CUDA callback to Python code ([PR 7871](https://github.com/NVIDIA/TensorRT-LLM/pull/7871)), until these issues are fixed by PyTorch.

Another source of risk comes from some runtime components that are implemented in C++ and exposed as Python bindings; they may make CUDA API calls as well. By default, Python bindings do not release GIL. Hence, we swept these Python bindings and released GIL properly ([PR 6948](https://github.com/NVIDIA/TensorRT-LLM/pull/6948)).

After all these efforts, the hang issue disappears. It is generally recommended to release the GIL when calling C++ code from Python; even without the context of CUDA callback, this is beneficial for multi-threading performance. However, we acknowledge the limitation that it is difficult to make sure that every such place has been properly handled, and that future code changes do not introduce any risks.

> **Note:** Theoretically, the GIL-free Python ([PEP 703](https://peps.python.org/pep-0703)) could be another remedy.

## Performance and Analysis

We benchmark the performance of guided decoding on two datasets [JSON Mode Eval](https://huggingface.co/datasets/NousResearch/json-mode-eval) and [JSON Schema Bench](https://huggingface.co/datasets/epfl-dlab/JSONSchemaBench). The models are [LLaMA 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [LLaMA 3.3 70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), the GPUs are H200 and the grammar backend is XGrammar.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 5: Pareto curve on LLaMA 3.1 8B TP1 on H200, JSON Mode Eval. The concurrency ranges from 1 to 128.</em></sub></p>


<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 6: Pareto curve on LLaMA 3.3 70B TP4 on H200, JSON Mode Eval. The concurrency ranges from 1 to 128.</em></sub></p>

Figures 5 and 6 present the Pareto curves on JSON Mode Eval for LLaMA 3.1 8B and LLaMA 3.3 70B, respectively. Speculative decoding achieves significant speedup for low-latency or throughput@latency scenarios. In particular, the speedup can be up to ~2x for batch size 1. The one-model EAGLE3 implementation is more performant than the two-model EAGLE3, and this performance gap is amplified for small models. This is reasonable, because the one-model implementation captures more workloads into a single CUDA graph, which results in less (if any) exposed CPU overhead.

Note that although NGram is a two-model implementation, it performs surprisingly well. This is because JSON Mode Eval is an information extraction task. Each prompt contains the JSON schema and all the information required by the response, so the NGram has a high acceptance rate on this dataset.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 7: Pareto curve on LLaMA 3.1 8B TP1 on H200, JSON Schema Bench. The concurrency ranges from 1 to 128.</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 8: Pareto curve on LLaMA 3.3 70B TP4 on H200, JSON Schema Bench. The concurrency ranges from 1 to 128.</em></sub></p>

Figures 7 and 8 show the results on JSON Schema Bench. The one-model EAGLE3 achieves the best performance across almost all scenarios. Note that the NGram becomes less performant since the task is no longer an information extraction task, although the JSON schemas are still present in the prompts.

| Dataset | Model | EAGLE3 | EAGLE3 w/o draft | NGram |
| :-----: | :---: | :----: | :--------------: | :---: |
| JSON Mode Eval    | LLaMA 3.1 8B  | 2.86 | 2.65 | 2.59 |
| JSON Mode Eval    | LLaMA 3.3 70B | 2.72 | 2.60 | 2.44 |
| JSON Schema Bench | LLaMA 3.1 8B  | 2.55 | 2.33 | 1.89 |
| JSON Schema Bench | LLaMA 3.3 70B | 2.50 | 2.30 | 1.87 |

<p align="center"><sub><em>Table 1: Average acceptance lengths per iteration for EAGLE3 and NGram. The acceptance length includes the golden token. The draft length is 3. "EAGLE3 w/o draft" means the draft model does not apply grammar constraints.</em></sub></p>

Table 1 lists the average acceptance lengths per iteration. We perform an ablation experiment where the draft model does not apply grammar constraints. As presented, this does decrease acceptance rates, but by a slighter margin than expected. Note that it introduces extra overheads to apply grammar constraints on the draft model:

* In the drafting loop, the extra mask applying kernels slightly contribute to the GPU time.
* If the drafting forwards are too fast to hide the grammar computation, the exposed CPU time will cause bubbles in the GPU timeline.

These extra overheads could partially offset the benefits from the improved acceptance.

## Acknowledgements

This work demonstrates an outstanding example of cross-team collaboration between the TensorRT LLM and XGrammar teams. We sincerely appreciate the support from everyone who contributed to making this happen.

We acknowledge that it is built on top of the tremendous existing foundations from the community. In particular, some designs were inspired by vLLM [PR 14702](https://github.com/vllm-project/vllm/pull/14702) and SGLang [PR 6499](https://github.com/sgl-project/sglang/pull/6499). In addition, special thanks go to the authors who proposed the speculative algorithms like EAGLE/MTP, and the grammar backend projects like XGrammar/LLGuidance.

---

# Inference Time Compute Implementation in TensorRT LLM

By NVIDIA TensorRT LLM Team and UCSD Hao AI Lab

## Table of Contents
- [Inference-Time Compute Implementation in TensorRT LLM (Part 1: Design and Implementation）](#inference-time-compute-implementation-in-tensorrt-llm)
  - [Table of Content](#table-of-content)
  - [Background and Motivation](#background-and-motivation)
  - [Introduction for Scaffolding: A Framework for inference-time compute](#introduction-for-scaffolding)
    - [Core Features](#scaffolding-core-feature)
    - [Architecture](#scaffolding-architecture)
      - [Worker](#scaffolding-architecture-worker)
      - [Controller](#scaffolding-architecture-controller)
      - [ScaffoldingLlm](#scaffolding-architecture-scaffoldingllm)
  - [An Example: Implement Dynasor on Scaffolding](#example-for-scaffolding)
    - [Introduction for Dynasor](#dynasor-introduction)
    - [Implement Dynasor-CoT in Scaffolding](#dynasor-cot-implement-in-scaffolding)
    - [Implement Dynasor-CoT based Majority Voting in Scaffolding](#dynasor-cot-based-majority-vote-in-scaffolding)
    - [Acknowledgements](#dynasor-acknowledgements)
    - [Reference](#dynasor-reference)
  - [Feature List on Scaffolding](#scaffolding-feature-list)
  - [Future Work](#scaffolding-future-work)


## Background and Motivation
Inference-time compute, also known as test-time scaling, is increasingly important. Beyond simply increasing output length, workflows such as best-of-N and Monte Carlo Tree Search (MCTS) offer additional capabilities for optimizing inference. Further, most of the workflows of agentic or multi-agent are logically similar to these methods of inference-time compute, except that they use more complex tools and context engineering. However, how to conveniently define these methods while achieving excellent inference performance has become a new problem. Because good performance requires careful asynchronous scheduling, but writing asynchronous scheduling programs is not easy for algorithm engineers. When considering the use of external tools and token budget management, the problem becomes even more complex.


LLM inference frameworks such as TensorRT LLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments.


So we want to build a good framework to support users in exploring and deploying more inference-time compute methods. It should provide a modular infrastructure and fill the gap in balancing usability and performance for inference-time compute.


## Introduction for Scaffolding: A Framework for inference-time compute

`Scaffolding` is a framework for inference-time compute with high performance. It makes it easy for users to integrate various methods (CoT, majority vote, best of N, MCTS) and execution backends (TRTLLM/Openai API/Tools) and also allows users to develop customized features such as token budget. 


### Core Features
The core features including:


Decouple inference-time compute method and execution backend. Provides `Controller` concept for users to define the method, `Worker` concept to develop execution backend and `ScaffoldingLlm` to provide API for users to integrate `Controller` and `Worker` and run the request. 


Make the inference-time compute method modular and reusable. An inference time compute method can be
composed of multiple modules. In scaffolding, `Controller` can be constructed by a series of `Sub-Controllers`, then users can flexibly assemble and replace the `Sub-Controllers`.


Provides sufficient concurrency to achieve good performance while ease of use. Concurrency is the key for performance. `Scaffolding` provides three levels of concurrency. The first level is that the different requests to a `ScaffoldingLlm` instance can be concurrent. The second level is that the multiple `Sub-Controllers` can be concurrent.The third level is that the multiply Tasks which yielded from `Controller` can be concurrent.


### Architecture
`Scaffolding` consists of three core components. Let's first briefly introduce these components. The `Worker` class is the backend that execute a single task, such as sending an inference request to an LLM inference framework or service, or completing a call to an external tool. The `Controller` class focuses on defining the workflow of a inference-time compute method. The `ScaffoldingLlm` class is responsible for integrating the two and completing the entire task.


This is the call sequence diagram of `Scaffolding`:
<div align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog13_scaffolding_sequence.png" alt="Scaffolding Sequence" width="900px">
</div>
<p align="center"><sub><em>Figure 1. Scaffolding Sequence</em></sub></p>

Here we can focus on two points. First, `ScaffoldingLlm` provides users with the interface. Second, the `Controller` does not directly call the Worker.


Next, we will introduce the code of the core components.


#### Worker
```python
class Worker(ABC):

    async def run_task(self, task: Task) -> TaskStatus:
        worker_cls = type(self)
        if type(task) not in worker_cls.task_handlers:
            return TaskStatus.WORKER_NOT_SUPPORTED
        return await worker_cls.task_handlers[type(task)](self, task)

    task_handlers = {}
```
The core interface of `Worker` is `run_task()`, which accepts a `Task`, executes it, and writes the result to the appropriate field. It should be noted that `run_task()` is an asynchronous function and it can be concurrently and asynchronously called with python asyncio.


#### Controller
```python
class Controller(ABC):

    def __init__(self):
        self.task_collections = {}

    def clone(self):
        return copy.deepcopy(self)

    def generate(self, prompt: str, **kwargs) -> GenerationResult:
        task = GenerationTask.create_from_prompt(prompt)

        yield from self.process([task], **kwargs)

        return task.create_scaffolding_output()

    def process(self, tasks: List[Task], **kwargs):
        raise NotImplementedError
```
Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generate()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method.


Let's go into a specific subclass of `Controller` to see how `process()` is implemented. 
```python
class NativeGenerationController(Controller):

    class WorkerTag(Enum):
        GENERATION = "generation"

    def process(self, tasks: List[Task], **kwargs):
        for task in tasks:
            task.worker_tag = self.WorkerTag.GENERATION
            for key, value in self.sampling_params.items():
                if getattr(task, key) is None:
                    setattr(task, key, value)
            task.streaming = self.streaming

        yield tasks
```
Essentially, `process()` is an iterator in python that can return a list of tasks using yield statement. When the iterator is re-entered, that is, when the yield statement ends, the `Tasks` have been completed. That means the result of the `Task` has been written into its result field. Then the `process()` can proceed to the next steps.


From here we can see that the implement of the `Controller` can focus on the design of the workflow. It does not directly call the `Worker` and does not need to care about how these tasks are completed. And that is how `Scaffolding` decouple inference-time compute method and execution backend.


Also, `Controller` makes the inference-time compute method modular and reusable. It only requires the `sub-Controller` to be a member of class, and then the `process()` function of the `sub-Controller` is called using the “yield from” statement.
```python
yield from self.reward_controller.process(generation_tasks,
                                                **reward_kwargs)
```


For the concurrency with ease of use, `Controller` provides two ways. As the code above shows, the yield statement yield a list of `Task`. So the first one is that the multiple Tasks in a yield statement is executed in parallel. The second way is for the multiple `sub-Controller` which can be executed in parallel. `Controller` provides syntactic sugar called `ParallelProcess`.
```python
generation_controllers = [
            self.generation_controller for _ in range(sample_num)
        ]
        generation_kwargs_list = [generation_kwargs for _ in range(sample_num)]
        generation_tasks = [copy.deepcopy(task) for _ in range(sample_num)]

        yield ParallelProcess(generation_controllers,
                              [[t] for t in generation_tasks],
                              generation_kwargs_list)
```


#### ScaffoldingLlm
With `Controller` and `Worker`, we still need something that can combine them together, that is the `ScaffoldingLlm` class.
```python
llm_worker = TRTLLMWorker.init_with_new_llm(
    args.model_dir,
    backend="pytorch",
    max_batch_size=32,
    max_num_tokens=4096,
)

prototype_controller = NativeGenerationController(sampling_params={
    "temperature": 0.9,
    "max_tokens": 1024,
})

llm = ScaffoldingLlm(
    prototype_controller,
    {NativeGenerationController.WorkerTag.GENERATION: llm_worker},
)
results = llm.generate(prompts)
```
Users need to first create instances of `Worker` and `Controller`, and map them by `WorkerTag` to create the `ScaffoldingLlm` class. Then call the generate interface of `ScaffoldingLlm` to get the final result. 


`ScaffoldingLlm` also provides async interface.
```python
async for result in llm.generate_async(prompt):
    print(">>>", result.outputs[0].text)
```
Therefore, an instance of ScaffoldingLlm supports concurrent execution of multiple requests.


Let's make a summary of the overall implementation of `Scaffolding`. If users want to implement a new inference-time compute method, users can develop a new `Controller`. They can also call some existing `Controllers` as its `sub-Controller`. If users want to implement a new backend, users can either create a new `Worker` or add a new `Task` handler to an existing `Worker`.  As for `ScaffoldingLlM`, we have hidden many complex implementations, such as async scheduling within `ScaffoldingLlM`, and users do not need to modify the code of `ScaffoldingLlM`.


## An Example: Implement Dynasor-CoT on Scaffolding
Dynasor-CoT 
<a href="https://arxiv.org/abs/2412.20993">
  <img src="https://img.shields.io/badge/arXiv-2412.20993-b31b1b.svg?style=plastic" alt="arXiv" style="vertical-align: text-top;">
</a>
is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.

<div align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog13_dynasor_demo.gif" alt="Dynasor Demo" width="900px">
</div>
<p align="center"><sub><em>Figure 2. Demo of DeepSeek-R1-Distill-Qwen-7B achieving a 5.74x speedup compared to the baseline when using Dynasor-CoT on MATH500</em></sub></p>

### Introduction for Dynasor-CoT
#### Motivation of Dynasor-CoT
LLM reasoning is highly token-inefficient, often requiring far more tokens to achieve the same accuracy as non-reasoning models. A major source of this inefficiency is that reasoning models tend to **self-doubt**; they often reach the correct answer early but then engage in extended verification behaviors like double-checking and reassessment.

For instance, Figure 2 compares a traditional Qwen-7B model with a reasoning-focused, Deepseek-distilled Qwen-7B model on a simple question. While the traditional model reaches its answer in 180 tokens, the reasoning model expends 1,000 tokens on iterative verification, despite having already found the correct answer at token 340. This represents a significant waste of tokens for diminishing returns on accuracy.

<div align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog13_dynasor_hesitation.png" alt="Motivation" width="900px">
</div>
<p align="center"><sub><em>Figure 2. An example answer from reasoning model (Deepseek-distilled Qwen-2.5 7B) vs traditional model (Qwen-2.5 7B) on one of the problem in MATH500 dataset.</em></sub></p>

#### The "Probe" technique
Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique, which prompts reasoning models to output early-stage results during intermediate steps of reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are.

More specifically, a probe is an extra generation request with an eliciting prompt appended to the intermediate reasoning tokens. One effective eliciting prompt is: `Oh, I suddenly got the answer to the whole problem, Final Answer: boxed{`. Figure 3 shows an analysis comparing the accuracy of directly asking versus probing the model. Taking AMC23 as an example, reasoning models frequently arrive at correct answers early (median: 830 tokens) but continue generating unnecessary tokens due to self-doubt (median: 2.7K tokens).


<div align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog13_dynasor_pressure_testing.png" alt="Dynasor Demo" width="900px">
</div>
<p align="center"><sub><em>Figure 3. DeepSeek-R1's performance on AMC23 and AIME24 at varying token budgets. (Left) Standard reasoning with late answer outputs. (Right) Early answer extraction using the Probe-In-The-Middle technique, demonstrating equivalent accuracy with a 50% token reduction. The greener regions in the right panels suggest the model knows the answers much earlier than it reveals in standard reasoning.</em></sub></p>

#### How it speeds up inference
Instead of generating a fixed number of tokens or waiting for a stop token, Dynasor-CoT **probes the model regularly** (e.g., every 32, 64, or 128 tokens) and **terminates the process** early once a consistent answer is formed across recent probes. This avoids unnecessary computation, directly reducing latency.

Figure 4 provides an illustration:

* **Case 1**: All three probe requests yield the same answer, "3159.", indicating high certainty. The process can exit early.

* **Case 2**: Early-stage answers are inconsistent, indicating low confidence, so generation continues.

* **Case 3**: The model generates special tokens such as "wait" or "hmm," signaling hesitation; generation continues.

<div align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog13_dynasor_illustration.jpg" alt="Dynasor Illustration" width="900px">
</div>
<p align="center"><sub><em>Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are discarded.</em></sub></p>

### Implement Dynasor-CoT in Scaffolding
A key difference between inference-time compute methods like Dynasor-CoT and a normal LLM generation request is that the generation process can consist of multiple smaller, user-defined tasks. The results of these tasks can dynamically control the overall logic—for example, by determining whether to expand the scope of subsequent generation or to terminate the process entirely. In a single Dynasor-CoT request, generation proceeds chunk by chunk, with additional "probe" tasks running in parallel with the main generation. Once a consistent answer is formed across recent probes, the process terminates early.

`Scaffolding` provides a good solution for customizing these kinds of data flows. Within a `Controller`, we can customize the data flow logic by defining how and when these smaller tasks are submitted. To implement Dynasor-CoT, we simply inherit from the base `Controller` class and override the `process()` function to customize how it yields tasks. We don't need to worry about how these tasks are executed because the inference-time compute methods and the execution backend are modularized and decoupled in Scaffolding. These tasks are submitted to `ScaffoldingLlm`, which then dispatches workers to complete them.

Let's start the implementation by inheriting the `Controller` class and adding the necessary parameters for Dynasor-CoT.
```python
class DynasorGenerationController(Controller):

    class WorkerTag(Enum):
        GENERATION = "generation_with_dynasor_cot"

    def __init__(
        self,
        generation_dir,
        max_tokens=8192,
        certainty_threshold=3,
        chunk_size=64,
        streaming=False,
    ):
        super().__init__()
        self.generation_dir = generation_dir
        self.max_tokens = max_tokens
        self.certainty_threshold = certainty_threshold
        self.chunk_size = chunk_size
        self.uncertain_words = ["wait", "hold", "but", "okay", "no", "hmm"]
        self.probe_suffix = "... Oh, I suddenly got the answer to the whole problem, **Final Answer**\n\n\\[ \\boxed{"
        self.answer_suffix = "\n\n... Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{"
        self.answer_suffix_with_marker = "\n\n...</think>\n Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{"
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.generation_dir,
            legacy=False,
            padding_side='left',
            truncation_side='left',
            trust_remote_code=False,
            use_fast=True,
        )
        self.streaming = streaming
```

The `process()` function, as mentioned before, is the core method within the `Controller` class. Here, we can customize our data flow by specifying the logic for yielding tasks. For Dynasor-CoT, we have two different kinds of tasks:

1. `proposer_task`: Handles the main content generation, producing self.chunk_size tokens based on the previous content.

2. `probe_task`: Elicits an early-stage answer by generating 20 tokens from the same content.

The code below creates these two types of tasks.

```python
    def process(self, tasks: List[GenerationTask], **kwargs):
        # Start with the initial prompt provided by the first task.
        initial_prompt = tasks[0].input_str

        proposer_task = GenerationTask()
        proposer_task.max_tokens = self.chunk_size
        proposer_task.temperature = 0.6
        proposer_task.top_p = 0.95
        proposer_task.worker_tag = self.WorkerTag.GENERATION

        probe_task = GenerationTask()
        probe_task.max_tokens = 20
        probe_task.temperature = 0.6
        probe_task.top_p = 0.95
        probe_task.worker_tag = self.WorkerTag.GENERATION

        probe_answers = []
        probe_responses = []

        initial_prompt_token_num = len(
            self.tokenizer.encode(initial_prompt, add_special_tokens=False))
        probe_suffix_token_num = len(
            self.tokenizer.encode(self.probe_suffix, add_special_tokens=False))

        current_prompt = initial_prompt
```

To prevent extra latency, the `proposer_task` should not be blocked by the `probe_task`. Scaffolding's task-level concurrency handles this perfectly. We can yield `proposer_task` and `probe_task` in a single list. Multiple tasks yielded together in the same list will be batched and executed in parallel.

```python
    yield[proposer_task, probe_task]
```

In the following `for` loop, each iteration performs these steps:

1. **Submit** both a proposer task and a probe task by yielding them. We don't need to worry about execution details, as they are handled by `ScaffoldingLlm`, which binds the `Controller` and `Workers` together behind the scenes.

2. **Evaluate** the probe response after the tasks return, checking for consistency over several rounds (using `certainty_threshold`).

3. **Finalize** the answer and return if it is consistent. Otherwise, append the new tokens from the proposer task and proceed to the next iteration.

```python
        # Iterate over generation rounds until the maximum tokens limit is reached.
        for _ in range(initial_prompt_token_num + probe_suffix_token_num,
                    self.max_tokens, self.chunk_size):
            proposer_task.input_str = current_prompt
            probe_task.input_str = current_prompt + self.probe_suffix

            # For the probe task, append the suffix to force a chain-of-thought leading to an answer.
            yield [proposer_task, probe_task]

            # Retrieve the output from the probe task.
            probe_text = probe_task.output_str

            # Extract the potential answer from the probe response.
            answer = self.obtain_answer(probe_text)
            probe_answers.append(answer)
            probe_responses.append(probe_text)

            if self.should_early_stop(probe_answers, probe_responses):
                tasks[0].result = probe_task.result
                # If the current prompt indicates the chain-of-thought phase has ended, use one type of suffix.
                if "</think>" in current_prompt:
                    tasks[0].output_str = (current_prompt + self.answer_suffix +
                                        probe_answers[-1] + "}\n\\]")
                    return
                else:
                    # Otherwise, use the suffix with marker to transition clearly.
                    tasks[0].output_str = (current_prompt +
                                        self.answer_suffix_with_marker +
                                        probe_answers[-1] + "}\n\\]")
                    return

            # If the answer is not deemed confident, perform another round of generation.
            # Append the newly generated text from the proposer to the current prompt for the next iteration.
            current_prompt += proposer_task.output_str

        # If the maximum token limit is reached without satisfying the certainty condition,
        # output the accumulated prompt as the final output.
        tasks[0].result = proposer_task.result
        tasks[0].output_str = current_prompt
        return
```
The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRT LLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state.

Now we have implemented a `Controller` for Dynasor-CoT. Here is an example of how to use it:
```python
dynasor_generation_controller = DynasorGenerationController(
    # Parameters for DynasorGenerationController
    )

llm = ScaffoldingLlm(
    prototype_controller=dynasor_generation_controller, 
    # other parameters for ScaffoldingLLM
    )
results = llm.generate(prompts)
```

### Implement Dynasor-CoT based Majority Voting in Scaffolding
Scaffolding is designed to be modular and reusable. We can assemble methods just like LEGO building blocks. For instance, to implement Dynasor-CoT-based Majority Voting, we can simply stack our `DynasorGenerationController` with a `MajorityVoteController`.

Once a controller for majority voting is built, no further implementation is needed. We can directly stack the two controllers as shown below.
```python
dynasor_generation_controller = DynasorGenerationController(
    # Parameters for DynasorGenerationController
    )

majority_vote_controller = MajorityVoteController(
    generation_controller=dynasor_generation_controller, # stack here
    # Other parameters for MajorityVoteController
    )

llm = ScaffoldingLlm(
    prototype_controller=majority_vote_controller, # Expose the outermost controller to ScaffoldingLlm
    # other parameters for ScaffoldingLLM
    )
results = llm.generate(prompts)
```


### Acknowledgements
This work demonstrates an outstanding example of cross-team collaboration between the TensorRT LLM and UCSD Hao AI Lab. We sincerely appreciate the support from everyone who contributed to making this happen.


### Reference
[1] Y. Fu*, J. Chen*, Y. Zhuang, Z. Fu, I. Stoica, and H. Zhang, "Dynasor: More Efficient Chain-of-Thought Through Certainty Probing," Hao-AI-Lab Blog, Feb. 16, 2025. [Online]. Available: https://hao-ai-lab.github.io/blogs/dynasor-cot/


## Feature List on Scaffolding
You can customize your own `Controller`, `Worker` and `Task`, however, we have provided a foundational set with commonly used functionality that you can use.


`Worker`: TensorRT LLM, OpenaiAPI, MCP;


`Task`: Generation, Reward, ToolCall;


`Controller`: MajorityVote, PRMReward, BestOfN, MCTS;


## Future Work
The future work is divided into two parts.


The first part is to enable `Scaffolding` to support more inference-time compute methods, especially the methods of agentic and multi-agent system. 


The second part is that we hope to find more opportunities to optimize TensorRT LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately.


Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding).

---

# Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)

This blog post is a continuation of previous posts:
* [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
* [Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)

In this blog post, we focus on performance optimization, diving deeper into techniques such as lower precision, network structure refactoring, and aggressive kernel fusion. We hope this analysis and optimization process brings new inspiration to your model inference optimization work.

*By NVIDIA TensorRT LLM Team*

## Table of Contents
- [Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)](#scaling-expert-parallelism-in-tensorrt-llm-part-3-pushing-the-performance-boundary)
  - [Table of Contents](#table-of-contents)
  - [Overview](#overview)
  - [Lower precision](#lower-precision)
    - [wo GEMM FP4 quantization](#wo-gemm-fp4-quantization)
    - [Low precision `AlltoAll`](#low-precision-alltoall)
    - [FP8 context FMHA support](#fp8-context-fmha-support)
  - [Rethink network structure](#rethink-network-structure)
    - [MTP LM head tensor parallelism](#mtp-lm-head-tensor-parallelism)
    - [Context phase Q/K/V `concat` optimization](#context-phase-qkv-concat-optimization)
  - [More kernel overlap, fusion and optimization](#more-kernel-overlap-fusion-and-optimization)
    - [Overlap kernels using programmatic dependent launch (PDL)](#overlap-kernels-using-programmatic-dependent-launch-pdl)
    - [Fuse several `AlltoAll` kernels](#fuse-several-alltoall-kernels)
    - [Fuse `add` (sparse exp and shared exp) into local reduction](#fuse-add-sparse-exp-and-shared-exp-into-local-reduction)
    - [Optimize PyTorch native `copy` and `concat` using `torch.compile`](#optimize-pytorch-native-copy-and-concat-using-torchcompile)
  - [End-to-End Performance](#end-to-end-performance)
  - [Acknowledgements](#acknowledgements)

## Overview

Let's firstly take a look at how the network structure looks like before we did the optimizations, to give an overall review on how the workloads look like:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_overview_before_opt.png" width="600">
</figure>
</div>
<p align="center"><sub><em>Figure 1: Network structure overview before optimization</em></sub></p>

In this third blog of our scaling Expert Parallelism (EP) series, we push the performance boundaries of large-scale EP on NVIDIA GB200 NVL72 through multiple optimization techniques. Building upon the foundation established in [part 1](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) and [part 2](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md), this blog explores three key optimization pillars: **lower precision computation** (including FP4 quantization for wo GEMM, low-precision AlltoAll communication, and FP8 context FMHA), **network structure rethinking** (featuring MTP LM head tensor parallelism and context phase Q/K/V concatenation elimination), and **aggressive kernel fusion and overlap** (leveraging Programmatic Dependent Launch, fused AlltoAll operations, and torch.compile optimizations). These optimizations collectively deliver significant end-to-end performance improvements for wide-EP scenarios on NVIDIA GB200 NVL72, for DeepSeek R1 with its specialized Multi-head Latent Attention (MLA) mechanism. Each technique is carefully designed to maintain accuracy while maximizing performance, demonstrating the power of combining algorithmic innovation with deep hardware awareness.

## Lower precision

### wo GEMM FP4 quantization

The wo GEMM is the final linear layer within the multi-head attention block that produces the final outputs. While DeepSeek R1's MLA modifies the initial projections for keys and values, the wo GEMM operator remains a critical and standard component for finalizing the attention computation. In the term, "wo" is the abbreviation for the weight matrix for the output.

We've evaluated that quantizing the wo GEMM to FP4 still satisfies the accuracy requirements, maintaining a similar MTP accept rate (AR) while improving end-to-end performance. The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) team has published checkpoints that additionally quantize the wo module in attention layers to FP4 on HuggingFace:
* https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2
* https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2

In TensorRT LLM, this is supported by [PR 6393](https://github.com/NVIDIA/TensorRT-LLM/pull/6393). To utilize the checkpoints, simply use the LLM API or `trtllm-serve` to load them. Refer to [deploy-with-tensorrt-llm](https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2#deploy-with-tensorrt-llm) for more details.

### Low precision `AlltoAll`

In wide-EP MoE, the combine phase (after experts finish FC2) performs an all-to-all to return each token’s expert outputs to its origin rank, followed by a per-token reduce over top-k experts.

This step is typically bandwidth-bound when FC2 outputs are in BF16 or FP16. We introduce a low-precision AlltoAll that transmits these combine payloads in NVFP4 instead of BF16/FP16, then dequantizes back on the receiver before the local reduction.

During combine, we temporarily quantize the per-token expert outputs to NVFP4 (e2m1 values with per-16-element E4M3 scale factors plus a global scale) inside shared memory, send the compact representation across GPUs, and dequantize back to the original dtype on the receiving side. Indices and routing-related small tensors remain in their native types.

Since we quantize only for transport and outputs are dequantized back to the working dtype before the per-token reduction, we observe negligible accuracy impact; tolerances comparable to a quant-dequant roundtrip are sufficient. This feature is supported by [PR 7155](https://github.com/NVIDIA/TensorRT-LLM/pull/7155) and [PR 7898](https://github.com/NVIDIA/TensorRT-LLM/pull/7898).

### FP8 context FMHA support

FP8 context FMHA is a technique that uses the FP8 data format to accelerate the FMHA/MLA computation during the context phase of a model. This combination is designed to improve TTFT and prefill throughput, particularly when processing long contexts, without significantly sacrificing accuracy.

In the context phase, the K and V can be stored in FP8 format, which is often referred to as FP8 KV Cache. Using FP8 KV cache can significantly save GPU memory, which is especially beneficial for long input sequences.
However, since Q is in BF16 format, FMHA will also be performed in BF16 format, which cannot benefit from FP8 Tensor Core.

With FP8 context FMHA, we first quantize Q into FP8 format, which aligns with FP8 K and V, and then leverage FP8 Tensor Core for FMHA/MLA. Since the context phase is compute-bound and Tensor Core has much higher FP8 FLOPS than BF16 FLOPS, the speed-up becomes more pronounced as the input sequence length grows.

Since FP8 context FMHA can maintain accuracy very close to the BF16 baseline, we enable it automatically when users use FP8 KV cache on Hopper or Blackwell. This is supported by [PR 7610](https://github.com/NVIDIA/TensorRT-LLM/pull/7610) and [PR 7612](https://github.com/NVIDIA/TensorRT-LLM/pull/7612).

## Rethink network structure

### MTP LM head tensor parallelism

The LM (language modeling) head is responsible for converting the `hidden_states` computed by previous decode layers to `logits`. It's a linear layer with weights in the shape of `(vocab_size, hidden_size)`, outputting logits with the shape of `(batch_size, seqlen, vocab_size)`. We are primarily interested in the logits corresponding to the last token of the input sequence, so the logits will finally be `(batch_size, vocab_size)`.

When MTP is enabled, the number of tokens that MTP layers handle will be equal to the batch size, while the main model will handle `(1 + MTP) * batch_size` tokens, which makes the LM head computation on MTP layers easier to fall into the memory-bound range, and 256 tokens is the empirical boundary between memory-bound and math-bound. This leads to an optimization idea: if we keep the calculation memory-bound but reduce the size of weights that need to be loaded, there could be performance benefits.

Based on this analysis, we conducted experiments on the following scenario: a DeepSeek R1 EP32 case with attention DP and MTP-3 enabled, where the local per-rank batch size is 32. Before the optimization, there is 32-way data parallelism, so each MTP module on each rank processes 32 tokens for LM head calculation.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_MTP_parallel_1.png" width="500">
</figure>
</div>
<p align="center"><sub><em>Figure 2: MTP LM head computation before optimization</em></sub></p>

In the optimization, we first perform an `AllGather` on every 4 GPUs, so that each GB200 node has all tokens prepared for the following TP4 calculation. Then, we split LM head weights on the token dimension to fit those 4 GPUs and perform 4-way TP. Afterwards, we collect the local argmax logits on each TP rank, do a round of `AllGather` to collect that, and find the global argmax logits across all TP ranks. Collecting the local argmax logits firstly helps with minimizing communication and argmax computation overheads. Finally, we split logits to guarantee correctness.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_MTP_parallel_2.png" width="500">
</figure>
</div>
<p align="center"><sub><em>Figure 3: MTP LM head computation after applying tensor parallelism</em></sub></p>

*Some layers are omitted in the diagrams above to keep the example simple.*

Note that we can expand the TP to 8-way to utilize multi-node NVLink, as long as we still achieve performance gains from reducing weight loading time in memory-bound scenarios.

This feature is supported by [PR 7571](https://github.com/NVIDIA/TensorRT-LLM/pull/7571) and [PR 7891](https://github.com/NVIDIA/TensorRT-LLM/pull/7891).

### Context phase Q/K/V `concat` optimization

In the standard attention mechanism, Q/K/V are derived from the same hidden states through `GEMM_Q`/`GEMM_K`/`GEMM_V` operations, and TensorRT LLM typically merges the weights of these three GEMMs in advance, executing a single `GEMM_QKV` to obtain a large contiguous tensor QKV, which is then used as the input to the attention kernels.

However, DeepSeek's MLA is a special attention module where Q/K/V are obtained by applying different downsampling-upsampling processes to the hidden states. Additionally, Q and K are divided into two parts: with RoPE and without RoPE, so a contiguous QKV tensor cannot be obtained directly.

In the initial implementation of context MLA, due to input format constraints of the attention kernels, TensorRT LLM had to explicitly concatenate the Q/K/V tensors into one contiguous QKV tensor, resulting in extra memory and time overhead, which became more significant in wide EP scenarios.

Recently, we introduced a new input format for the context MLA kernels called "separate qkv". As the name implies, these attention kernels now support three separate Q/K/V tensors as direct inputs. [PR 6538](https://github.com/NVIDIA/TensorRT-LLM/pull/6538) refactors the MLA process to eliminate the need for concatenating Q/K/V, saving copy operations and significantly improving prefill latency in wide EP scenarios.

## More kernel overlap, fusion and optimization

The team has implemented aggressive kernel fusion, overlap, and optimization to reduce kernel launch overheads and overall kernel duration. This includes overlapping kernels using PDL, fusing several `AlltoAll` kernels through refactoring, fusing sparse exp and shared exp `add` into local reduction, fusing `memset` into `expandinputrow`, fusing `finalizeMoeRouting` into FC2, and removing the `swizzle` kernel after `AlltoAll`. The following three representative examples demonstrate the common ideas behind these optimizations.

### Overlap kernels using programmatic dependent launch (PDL)

The Programmatic Dependent Launch (PDL) mechanism allows a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing. Refer to the [official documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) for more details. TensorRT LLM has been utilizing this feature to optimize end-to-end performance.

We have introduced this feature to the kernels used by the wide EP workflow as well. The implementation is in [PR 7977](https://github.com/NVIDIA/TensorRT-LLM/pull/7977). We inserted the `cudaTriggerProgrammaticLaunchCompletion` API with all thread blocks in the primary kernel, which signals that it's ready for the secondary kernel to launch, and then call the `cudaGridDependencySynchronize` API in the secondary kernel, which blocks until all primary kernels the secondary kernel depends on have completed and flushed results to global memory. The following example from the [official documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#api-description) demonstrates how PDL is supported in TensorRT LLM, the only difference is that we inserted `cudaTriggerProgrammaticLaunchCompletion` and `cudaGridDependencySynchronize` to the same kernel so that it can both overlap with the front and subsequent kernels.
```c
__global__ void primary_kernel() {
   // Initial work that should finish before starting secondary kernel

   // Trigger the secondary kernel
   cudaTriggerProgrammaticLaunchCompletion();

   // Work that can coincide with the secondary kernel
}

__global__ void secondary_kernel()
{
   // Independent work

   // Will block until all primary kernels the secondary kernel is dependent on have completed and flushed results to global memory
   cudaGridDependencySynchronize();

   // Dependent work
}
```

We have verified the accuracy after the modification to ensure that computation results are not affected by incorrect memory reads and writes. With this premise, we made those kernels overlap as much as possible for performance considerations. In TensorRT LLM, PDL can be enabled by setting the environment variable `TRTLLM_ENABLE_PDL` to `1`, and we may introduce this as an official API in the future.

The effect of enabling PDL can be clearly observed using [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems). Taking `moeComputeRouteKernel`, `computeCountAndIndiceDevice` and `computeCumsumDevice` kernels as an example, they are executed in order when disabling PDL:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_pdloff.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 4: The profiling results of disabling PDL.</em></sub></p>

The following profiling results show how the three kernels overlap after enabling PDL.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_pdlon.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 5: The profiling results of enabling PDL.</em></sub></p>

*The above profiles were generated by using commit [84d2f12](https://github.com/NVIDIA/TensorRT-LLM/tree/84d2f1281857fbb1662b14603d3123cf327ac94f) on the main branch. They may change in future versions.*

For tips on using Nsys to profile and analyze TensorRT LLM performance, refer to [Coordinating with NVIDIA Nsight Systems Launch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-analysis.md#coordinating-with-nvidia-nsight-systems-launch).

### Fuse several `AlltoAll` kernels

To better support communication fusion—including `hiddenStates` during dispatch, low-precision ScalingFactor, MoE's `tokenSelectedExpert` and scales, as well as supporting low-precision communication during dispatch and handling potential non-alignment issues in original data, we redesigned and reimplemented `AlltoAll`.

Taking the dispatch of four fields as an example, the data flow is shown in Figure 6.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_alltoall_dataflow.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 6: The data flow of new Alltoall kernel</em></sub></p>

The sending process is as follows:
- The first step loads the original data according to the data alignment in global memory, using TMA to load into shared memory as `unAlignedData`.
- Next, in shared memory, all fields are aligned to 16-byte boundaries and different fields are concatenated together to form `alignedData`.
- If low-precision communication is needed, the aligned data is quantized into low-precision `lowPrecisionData`. Currently, quantization is only supported for a single field.
- Next, corresponding encoding is performed according to the protocol. For example, with LL128, each 128 bytes contains 120 bytes of valid data and 8 bytes of flags. To avoid bank conflicts during encoding in shared memory, we select different flag positions for different packets, and the final encoded data is stored in `protoPackedData+Flag`.
- Finally, the proto-encoded `protoPackedData+Flag` is written to the remote GPU's workspace.

For the receiver, it only needs to check the flag at the corresponding position in the workspace to confirm whether the data is ready. If ready, the original data is decoded in the reverse manner of sending and written to the corresponding tensors.

Through this approach, we can support sending and receiving multiple arbitrarily aligned fields in a fused manner and support low-precision communication during the combine process. This feature was implemented in [PR 6973](https://github.com/NVIDIA/TensorRT-LLM/pull/6973).

### Fuse `add` (sparse exp and shared exp) into local reduction

To reduce the number of kernel launches and achieve better overlap at the tail of the MoE module, we've fused the shared-expert add into the local reduction kernel that aggregates top-k experts. This removes the extra add operator without increasing the reduce operator's overhead. It also achieves single write-out and lower bandwidth occupancy.

The optimization is compatible with NVFP4 combine without requiring any API changes and brings no accuracy impact. It was added by [PR 7422](https://github.com/NVIDIA/TensorRT-LLM/pull/7422).

### Optimize PyTorch native `copy` and `concat` using `torch.compile`

We have observed several inefficient `copy` and `concat` operations on context phase in wide EP scenarios, and one significant case is copying `k_nope` in the MLA module. As mentioned in previous section, Q and K are divided into two parts in DeepSeek MLA: with RoPE and without RoPE. In context phase, head size of nope will be 128, and that of rope will be 64, which adds up to 192 head size. However, the FMHA kernel will directly read Q and K with head size 192, which means that we have to prepare the full Q and K using `copy` and `concat`.

On ISL/OSL 8k/1k, batch size 1 cases, on context phase, we observed that the `copy` operation takes 306us, which is clearly suboptimal. If we try to calculate a theoretical duration, considering 8 TB/sec HBM3e bandwidth, the formula would roughly be:
```
( ISL 8192 * k_nope_size 128 * num_heads 128 * 2 bytes * read/write 2 ) / ( 8 TB/sec * efficiency 0.8 ) = 80 us
```

To optimize the operator, we simply added `torch.compile` decorator to the operation, and the kernel duration directly drops to 107us, which is greatly reduced and already on a promising level. [PR 8044](https://github.com/NVIDIA/TensorRT-LLM/pull/8044) implemented the changes. This is an outstanding example demonstrating the power of `torch.compile`, and showing the process of analyzing and optimizing without heavily hand-crafting kernels.

## End-to-End Performance

After applying the optimizations above, the network structure is cleaner. For example, `o_proj` and `A2A tokens` now compute in lower precision, and operators like `add` of sparse‑expert and shared‑expert is now fused into the `reduction`. The optimized parts are marked in **bold**.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_overview_after_opt.png" width="600">
</figure>
</div>
<p align="center"><sub><em>Figure 7: Network structure overview after optimization</em></sub></p>

We measured one round of performance and compared it with the baseline (main branch in July). With the optimizations mentioned above, we can see a significant performance improvement.
<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog14_perf.png" width="600">
</figure>
</div>
<p align="center"><sub><em>Figure 8: End-to-End Performance on Aug 31st</em></sub></p>

*Note: The numbers were collected on August 31st. Some optimizations mentioned above were not yet added at that time.*

To review how wide EP helps with Blackwell's leading inference benchmarks, also read these recent blog posts:
* [NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX™ v1 Benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/)
* [NVIDIA Blackwell Raises Bar in New InferenceMAX Benchmarks, Delivering Unmatched Performance and Efficiency](https://blogs.nvidia.com/blog/blackwell-inferencemax-benchmark-results/)

## Acknowledgements
This is a great continuation of previous work on TensorRT-LLM wide EP and another demonstration of excellent teamwork. It stems from brilliant performance optimization ideas, solid performance analysis and benchmarking, and rapid engineering support and implementation. By sharing these experiences, we hope to help more people who are interested in deploying large-scale LLM models on NVIDIA GPUs to run AI faster.

---

# Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
by NVIDIA TensorRT LLM team
## Table of Contents

- [Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs](#pushing-latency-boundaries-optimizing-deepseek-r1-performance-on-nvidia-b200-gpus)
  - [Table of Contents](#table-of-contents)
  - [Background](#background)
  - [Implementation Configuration](#implementation-configuration)
    - [Workload Profile](#workload-profile)
    - [Model Architecture](#model-architecture)
    - [Precision Strategy](#precision-strategy)
    - [Parallelism Strategy](#parallelism-strategy)
    - [Everything in One Diagram](#everything-in-one-diagram)
  - [Key Optimizations](#key-optimizations)
    - [System Level optimizations](#system-level-optimizations)
      - [CUDA Graph \& Programmatic Dependent Launch](#cuda-graph--programmatic-dependent-launch)
      - [MTP](#mtp)
        - [Autoregressive MTP Layers](#autoregressive-mtp-layers)
        - [Relax Acceptance Verification](#relax-acceptance-verification)
      - [Multi-streams](#multi-streams)
      - [Sparse Experts as GEMMs (only works when moe\_backend=CUTLASS)](#sparse-experts-as-gemms-only-works-when-moe_backendcutlass)
      - [Re-balanced the sparse experts](#re-balanced-the-sparse-experts)
        - [Mixed ETP](#mixed-etp)
        - [Smart Router](#smart-router)
    - [Kernel Level optimizations](#kernel-level-optimizations)
      - [Attention Kernel](#attention-kernel)
      - [Grouped GEMM](#grouped-gemm)
        - [CUTLASS Backend (default backend)](#cutlass-backend-default-backend)
        - [TRTLLM Backend](#trtllm-backend)
      - [Communication Kernel](#communication-kernel)
      - [Dense GEMM optimization](#dense-gemm-optimization)
        - [Fuse\_A\_GEMM](#fuse_a_gemm)
        - [RouterGEMM](#routergemm)
      - [Kernel fusion](#kernel-fusion)
  - [How to reproduce](#how-to-reproduce)
  - [Future Works](#future-works)
  - [Acknowledgment](#acknowledgment)

## Background
Recent advancements in Large Language Reasoning Models have demonstrated remarkable success, while creating new deployment challenges. A critical challenge emerges from extended Output Sequence Lengths (OSL) due to complex "thinking and reasoning" processes. Longer OSL demands stricter Token-to-Token Latency (TTL) requirements, often forcing concurrency limitations. The most extreme case, single concurrency (min-latency scenario) , becomes particularly challenging for real-time applications.

This article explores how TensorRT LLM achieves record-breaking performance for [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) in min-latency scenarios on NVIDIA's 8×B200 GPU configuration progressing from 67 tokens per second (TPS) to 253 before GTC 2025(**3.7x** speed-up), and to our current number is 368 TPS (**5.5x** speed-up).


## Implementation Configuration

### Workload Profile
Input Sequence Length (ISL): 1k tokens

Output Sequence Length (OSL): 2k tokens

### Model Architecture
The base DeepSeek-R1 main model contains: 3x dense layers (initial) and 58x MoE layers, there is also 1x Multi-Tokens Prediction (MTP) layer (MoE-architecture equivalent) for speculative decoding.  Our optimized configuration extends the MTP layer to 3x layers using autoregressive styling for peak performance exploration.

<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog1_model_overview.png?raw=true" alt="tech_blog1_model_overview" width="500" height="auto">

### Precision Strategy
We have explored a mixed precision recipe, which provides a better tradeoff between accuracy and performance.

|               Component               | Precision |
|:-------------------------------------:|:---------:|
|  64x Attention Modules                |   bf16*   |
|  3x Dense FFN Layers                  |  nvfp4**  |
|  58x MoE FFN Layers                   |   nvfp4   |
|  3x MTP Layers                        |   bf16    |
|  RouterGEMM***                        |   bf16    |

*TensorRT LLM already supports [FP8 Attention](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#fp8-kv-cache-and-mla) while for this latency scenario low-precision attention computation doesn't help with performance so we choose to use bf16 precision for the Attention Modules.

** nvfp4 model checkpoint is generated by the [NVIDIA Model Optimizer toolkit](https://github.com/NVIDIA/Model-Optimizer).

*** RouterGEMM uses bf16 inputs/weights with fp32 outputs for numerical stability


### Parallelism Strategy
We have also explored and introduced mixed parallel strategy on 8xB200 GPUs. Specifically, the best strategy for this latency scenario is 'TP8EP2', the definition represents

|       Component       |                   Parallelism Patterns                   |
|:---------------------:|:--------------------------------------------------------:|
| Attention Modules     | Tensor Parallelism 8 (TP8)                               |
| MoE Sparse Experts    | Mixed TP4 with Expert Parallelism 2 (EP2)               |
| MoE Shared Experts    | TP8                                                     |
| Fuse_A GEMM          | Data Parallelism 8 (DP8)                                 |
| RouterGEMM           | DP8                                                     |

### Everything in One Diagram
Now let's put everything into one diagram, which represents a MoE layer from a decoding iteration.

<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog1_model_details.png?raw=true" alt="tech_blog1_model_details" width="1600" height="auto">


The modules in the diagram are:

- Input Module: A BF16 tensor with shape [m, 7168], where m is the number of tokens (for instance, m = 4 when using three MTP layers), and 7168 is the model's hidden size.

- Module1: Fuse_A_GEMM Concatenates the weights for [WDQ, WDKV, and WKR](https://arxiv.org/pdf/2412.19437) to reduce kernel launch overhead.

- Module2: 2× RMSNorm Performs normalization for Q/K tensors. These can be either overlapped on multiple streams or fused into a single grouped RMSNorm.

- Module3: UQ_QR_GEMM Concatenates WUQ and WQR weights to reduce kernel launch overhead.

- Module4: UK_BGEMM Uses WUK in a batched GEMM. We avoid absorbing Modules 3 and 4 to prevent weight-size inflation and extra loading costs.

- Module5: Concat KVCache & applyRope Merges K/V cache and applies ROPE (Rotary Positional Encoding).

- Module6: genAttention Performs MLA during generation, acting like an MQA with num_q_heads = 128 / TP8 = 16.

- Module7: UV_GEMM Executes a batched GEMM with WUV weights.

- Module8: WO_GEMM Runs a dense GEMM using WO weights. We do not absorb Modules 7 and 8 to avoid increased weight loading overhead.

- Module9: Fused Kernels Incorporates oneshotAllReduce, Add_RMSNorm, and DynamicQuant (BF16->NVFP4) in a single kernel.

- Module10: routerGEMM & topK Handles the router GEMM and topK selection.

- Module11: Shared Expert Overlaps partially with Module10 and Module 12.

- Module12: Sparse Experts Implements expert layers via grouped GEMM.

- Module13: Final Fused Kernels Performs localReduction, oneshotAllReduce, and Add_RMSNorm operations together.

## Key Optimizations
| Feature                                                   | TPS/User | Code Links / Notes                                                                                                                                          |
|:----------------------------------------------------------|:--------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Baseline: CUDA Graph + EP8TP8                             |   67     | [modeling_deepseekv3.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py)                                |
| Multi Stream to overlap shared expert with sparse experts |   73     | [modeling_deepseekv3.py#L506](https://github.com/NVIDIA/TensorRT-LLM/blob/14bfb5e0d6e81aec3306a1324cf074566646f886/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L506) |
| Optimize MLA Kernel                                       |   80     | [PR #3763](https://github.com/NVIDIA/TensorRT-LLM/pull/3763)                                                                                                |
| Optimize TopK Kernels                                     |   84     | • [RoutingKernelTopK.cuh](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh)<br/>• [noAuxTcKernels.cu](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/noAuxTcKernels.cu) |
| Optimize Fuse_A_GEMM                                      |   89     | [attention.py#L345](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/modules/attention.py#L345)     |
| MTP3_Vanilla                                              |   154    | evolve to MTP3_Autoregressive                                                                                                                                                           |
| Evolve to MTP3_Autoregressive + Optimize Router GEMM      |   164    | [modeling_deepseekv3.py#L304](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L304) |
| Fuse oneshotAR + RMSNorm                                  |   168    | [allReduceFusionKernels.cu#L440](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu#L440) |
| Enable PDL                                                |   173    | Set environment variable: `export TRTLLM_ENABLE_PDL=1`                                                                                                      |
| Multi-stream to overlap two RMS_norms                     |   180    | [attention.py#L546](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/modules/attention.py#L546)     |
| MTP3_Autoregressive                                       |   204    | [modeling_deepseekv3.py#L823](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L823) |
| Finetune clock/power                                      |   211    | `sudo nvidia-smi -pm 0; sudo nvidia-smi -pm 1; sudo nvidia-smi boost-slider --vboost 4`                                                                     |
| Optimize CUTLASS Grouped GEMM Kernels                     |   236    | The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future.|
| Optimize CUTLASS Flow: Sparse Experts as GEMMs            |   249    | The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future.|
| Introduce EP4TP2 for better workload balance              |   253    | Use `--tp 8 --ep 4` when benchmarking                                                                                                                       |
| Introduce moe_backend=TRTLLM, EP2TP4 for better balance   |   299    | [PR #4280](https://github.com/NVIDIA/TensorRT-LLM/pull/4280)                                                                                          |
| Optimize Fuse_A_GEMM and Router_GEMM                      |   340    | WIP                                                                                          |
| Relax Acceptance                                          |   **368**    | [deepseek_v3#multi-token-prediction-mtp](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#multi-token-prediction-mtp)     |

### System Level optimizations
#### CUDA Graph & Programmatic Dependent Launch
[CUDA Graph](https://developer.nvidia.com/blog/cuda-graphs/) is necessary to overcome the CPU-overhead for small workloads, while [Programmatic Dependent Launch](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Programmatic%2520Dependent%2520Launch#programmatic-dependent-launch-and-synchronization) can be used to reduce the kernel launch latency furthermore.
#### MTP
There are two optimizations based on MTP
##### Autoregressive MTP Layers

| Version     | Acceptance Rate | TPS/User | TPS/User Speedup |
|:-----------:|:---------------:|:--------:|:----------------:|
| Without MTP |       1.00      |   111    |       1.00       |
| MTP 1       |       1.92      |   198    |       1.78       |
| MTP 2       |       2.58      |   250    |       2.25       |
| MTP 3       |       2.82      |   253    |       2.28       |
| MTP 4       |       2.99      |   245    |       2.21       |
| MTP 5       |       3.01      |   239    |       2.15       |

Based on our exploration, 3x MTP layers configuration demonstrates optimal performance.

##### Relax Acceptance Verification
For the reasoning model (such as DeepSeek R1), the generation may consist of two phases: thinking phase and actual output. During the thinking phase, when relaxed acceptance is enabled, the draft token can be accepted when it is in a candidate set. This candidate is generated based on the logits topN and probability threshold.
- topN: The topN tokens are sampled from logits.
- Probability threshold. Based on topN candidates, only those tokens with a probability greater than the Top1's probability - delta can remain in the candidate set.

During the non-thinking phase, we still use strict acceptance.

| Version            | Acceptance Rate | TPS/User Speedup |
|:------------------:|:--------------:|:----------------:|
| MTP3_top1, d0.0    |      2.82      |       1.00       |
| MTP3_top10, d0.5   |      3.06      |       1.08       |
| MTP3_top10, d0.6   |      3.10      |       1.09       |
| MTP3_top15, d0.5   |      3.07      |       1.08       |

This is a relaxed way of verification and comparison, which can improve the acceptance rate and bring positive speedup with limited influence on accuracy.

|          Dataset          | Test Size | w/o Relaxed accuracy | w/ Relaxed accuracy |
|:-------------------------:|:---------:|:----------:|:----------:|
| MMLU-Pro                  | 12,032    | 84.0%      | 81.2%      |
| Humanity's Last Exam      | 2,684     | 9.0%       | 9.0%       |
| GPQA Diamond              | 198       | 71.0%      | 69.2%      |
| MATH-500                  | 500       | 96.0%      | 96.2%      |
| AIME 2024                 | 30        | 68.0%      | 74.0%      |
| SciCode                   | 338       | 36.0%      | 39.0%      |
| LiveCodeBench             | 315       | 62.0%      | 66.0%      |

For more information, please visit [multi-token-prediction-mtp](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#multi-token-prediction-mtp)


#### Multi-streams
We have introduced multi-streams based optimizations to hide some kernels' overhead, such as:
- Overlap shared experts with sparse experts
- Overlap Concat_KVCache kernel with GEMM


#### Sparse Experts as GEMMs (only works when moe_backend=CUTLASS)

<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog1_sparse_exp_as_a_gemm.png?raw=true" alt="tech_blog1_sparse_exp_as_a_gemm" width="800" height="auto">

The existing CUTLASS-based Sparse Experts flow (illustrated in the figure) dispatches input tokens to their designated experts, then applies indexed local reduction on each expert's outputs before a global allreduce. Both dispatching and indexed local reduction incur high overhead in low-latency scenarios. To address this, we propose treating "Sparse Experts as GEMMs" by sending all tokens to each activated expert and masking out unneeded outputs before local reduction. Because grouped GEMMs are memory-bound, the extra computations from redundant tokens have minimal impact, effectively eliminating the costly dispatch and reduction overhead.

#### Re-balanced the sparse experts
For sparse experts, two parallelization strategies are commonly used: Expert Parallel (EP) and Tensor Parallel (TP). Expert Parallel (EP) maps each expert to a distinct GPU, achieving high memory and computational efficiency. However, token placement is data-dependent, distributing workloads unevenly across GPUs and revealing overhead in the AllReduce step after the MoE module. Tensor Parallel (TP) shards each expert evenly across GPUs, creating a balanced workload but sacrificing math/memory efficiency.


##### Mixed ETP
A combined EP/TP approach can mitigate both challenges. In practice, our experiments show that a configuration of TP4EP2 offers the best performance.

##### Smart Router
Alternatively, by storing all expert weights on a cluster of four GPUs and replicating them to another four-GPU cluster, a smart router can dynamically dispatch tokens across each cluster. This design keeps balanced workload distribution even without significantly impacting local memory and computation efficiency.


### Kernel Level optimizations
#### Attention Kernel
We have developed a customized MLA attention kernel to better utilize GPU resources for latency scenarios.
#### Grouped GEMM
##### CUTLASS Backend (default backend)
Our default MoE backend is based on CUTLASS, which is flexible/robust but may not be the best performance case.

##### TRTLLM Backend
The other MoE backend is TRTLLM, which provides better performance, and we are working to make it more flexible and robust, and in the future it will be switched as the default backend for Grouped GEMM computation for latency scenarios.

#### Communication Kernel
For small message sizes, regular NCCL latency-bound AllReduce kernels are inefficient, so we've developed a customized oneshot AllReduce kernel. It leverages the powerful NVSwitch HW capability by acting like an initial broadcast followed by local reduction, delivering better performance in min-latency scenarios.

#### Dense GEMM optimization
We focus on optimizing two kinds of dense GEMMs: Fuse_A_GEMM and RouterGEMM, because they dominate the execution time, suffer from low memory efficiency, and cannot be easily sharded (they are DP-based).

##### Fuse_A_GEMM
We developed a custom Fuse_A_GEMM that prefetches the majority of its weights into shared memory (enabled by PDL and overlapped with oneshot-AllReduce), significantly enhancing performance. The kernel shows substantial improvements over default GEMM implementation when num_tokens < 16.

<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog1_fuse_a_gemm.png?raw=true" alt="tech_blog1_fuse_a_gemm" width="500" height="auto">

##### RouterGEMM
By leveraging our internal AI code generator, we automatically generate an optimized RouterGEMM kernel, which delivers substantial improvements over the default GEMM implementation when num_tokens <=30.

<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog1_router_gemm.png?raw=true" alt="tech_blog1_router_gemm" width="500" height="auto">

#### Kernel fusion
Kernel fusion is necessary for min-latency scenario to reduce extra global memory write/read cost, and we support following fusion patterns now
- Fuse two overlapped RMS_Norms into one GroupedRMSNorm
- Fuse (LocalReduction) + AR+ RMS_Norm+ (Dynamic_Quant_bf16tonvfp4) into one kernel
- Fuse Grouped GEMM_FC1 + dot activation (when moe_backend=TRTLLM) into one kernel


## How to reproduce
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-min-latency

Of note, the Relaxed Acceptance is specific for Deepseek-R1 model, if you want to enable it, you need to set `add_generation_prompt = True` when preparing the benchmark dataset, the code demo likes
```python
input_ids = tokenizer.encode(tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True), add_special_tokens=False)
```
It's also needed to set `use_relaxed_acceptance_for_thinking: true`, `relaxed_topk: 10` and `relaxed_delta: 0.6` in speculative_config.


## Future Works
- More Fusions
- More Overlap
- More optimization of Attention Kernel
- More Exploration of MTP

## Acknowledgment
Pushing the performance boundaries of DeepSeek R1 for latency-sensitive applications has been a remarkable engineering journey. The optimizations detailed in this post represent an exceptional cross-functional collaboration across the entire AI technology stack - spanning kernel-level optimizations, runtime enhancements, model quantization techniques, algorithmic improvements, and systematic performance analysis and tuning. While we can't individually acknowledge every contributor, we're proud to recognize the dedicated team of engineers whose collective expertise has helped advance the state-of-the-art in TensorRT LLM performance engineering.

Through this collaborative endeavor, we've developed valuable insights into maximizing GPU utilization for large language model inference. We hope that the techniques and best practices shared in this blog will empower the developer community to better leverage NVIDIA GPU capabilities in their mission-critical LLM inference applications.

---

# DeepSeek R1 MTP Implementation and Optimization
by NVIDIA TensorRT LLM team
## Table of Contents
- [DeepSeek R1 MTP Implementation and Optimization](#deepseek-r1-mtp-implementation-and-optimization)
  - [Table of Contents](#table-of-contents)
  - [MTP for inference](#mtp-for-inference)
    - [Background](#background)
    - [MTP Vanilla](#mtp-vanilla)
    - [MTP Eagle](#mtp-eagle)
  - [MTP implementation in TensorRT LLM](#mtp-implementation-in-tensorrt-llm)
    - [Basic Implementation](#basic-implementation)
    - [MTP Modules](#mtp-modules)
    - [Attention for MTP](#attention-for-mtp)
    - [How to run DeepSeek models with MTP](#how-to-run-deepseek-models-with-mtp)
  - [MTP optimization - Relaxed Acceptance](#mtp-optimization---relaxed-acceptance)
    - [Relaxed Acceptance](#relaxed-acceptance)
    - [How to run the DeepSeek-R1 model with Relaxed Acceptance](#how-to-run-the-deepseek-r1-model-with-relaxed-acceptance)
  - [Evaluation](#evaluation)
    - [Achieving speedup with MTP speculative decoding](#achieving-speedup-with-mtp-speculative-decoding)
    - [Accuracy studies for Relaxed Acceptance](#accuracy-studies-for-relaxed-acceptance)
  - [Future Works](#future-works)
    - [Tree-based speculative decoding support](#tree-based-speculative-decoding-support)
    - [Eagle3 support](#eagle3-support)
    - [Fix known issues](#fix-known-issues)
  - [Acknowledgment](#acknowledgment)


TensorRT LLM achieves world-record inference performance for DeepSeek-R1 on NVIDIA Blackwell GPUs, where Multi-Token Prediction (MTP) delivers a significant speedup. In our [previous blog post](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md), we discussed the key optimizations that enable the outstanding inference latency of the DeepSeek-R1 model. This article dives deeper into the implementation and optimization of MTP in TensorRT LLM.

## MTP for inference
Inspired by a previous [research work](https://arxiv.org/pdf/2404.19737), MTP is designed to help the DeepSeek-V3 training. It adds additional MTP modules at the end of the main model and uses them to predict additional tokens. In this way, MTP can extend the prediction scope to multiple future tokens at each position to achieve better model accuracy. During inference, those MTP modules can also be used for speculative decoding to improve the generation latency further. In this section, we will introduce the MTP speculative decoding algorithm for LLM inference.

### Background
Speculative decoding is a popular technique for faster and cost-effective LLM inference. It’s based on the premise that generating multiple future tokens(especially for decode phase which is less compute bound) is more efficient than processing a single token. Speculative decoding techniques usually divide the process into a low-cost draft stage and a parallelized verification stage. The draft stage predicts draft tokens by using a small model or a subset of layers in the main model. And the verification stage uses the main model to determine how many of these draft tokens to accept, which is far more efficient than generating one token per iteration.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_verify_and_accept.png" alt="tech_blog2_verify_and_accept" width="1280" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. Verification example</em></sub></p>

Figure 1 shows an example of how to verify and accept those draft tokens. Assuming there are a total of 5 draft tokens “ABCDE”, we will extend them to the input token “G”, and input a total of 6 tokens to the main model. After sampling, we can get six different expected tokens, then compare the expected tokens with the draft tokens and accept the longest prefix matched tokens. In this example, the tokens “ABC” are matched. Because “H” is predicted by the main model and the corresponding input token “C” is already accepted, “H” will also be accepted. In this way, we can accept four tokens in a single iteration. MTP also uses this method to verify and accept draft tokens.
For the draft stage in MTP, there are two different MTP methods, MTP vanilla and MTP eagle. They can be used for different inference cases.

### MTP Vanilla

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_mtp_vanilla.png" alt="tech_blog2_mtp_vanilla" width="640" height="auto">
</figure>
</div>
<p align="left"><sub><em>Figure 2. MTP Vanilla, where t<sub>i</sub> is the input token, d<sub>i</sub> is the predicted draft token, K is the number of MTP modules, and h<sub>i</sub><sup>n</sup> is the hidden state of the n-th MTP module. Note that h<sub>0</sub> means the hidden states of the main model.  (Disclaimer: the figures adapted from the original DeepSeek V3 tech report)</em></sub></p>


MTP Vanilla method is more similar to the MTP training, and it sequentially uses different MTP modules to predict multiple draft tokens. This method can support model checkpoints with weights of multiple different MTP modules. And each MTP module will have its own KV cache.

Figure 2 illustrates the MTP vanilla inference. In the context phase, assuming there are a total of four input tokens, we will get the output token $t_5$ and the hidden states after the main model forward. The output token will be appended to the input tokens, then we shift out the first token to get tokens from $t_2$ to $t_5$ as the input tokens of the first MTP module. The hidden states from the main model will be directly used as the input of the first MTP module to predict the first draft token. For the next few MTP modules, we'll append the newly generated draft token and the hidden states corresponding to the last input token to the input tokens and hidden states. Then we'll shift out the first token to prepare the inputs for the next MTP module. In this way, we can retain as much information as possible from the main model, which helps the draft layer make more accurate predictions.

In the generation phase, there will be a little difference. The predicted token $t_5$ and the draft tokens will be used as inputs for the main model. After the main model forward, we will do the verification to get the accepted tokens. In this example, assuming $j$ draft tokens $d_6$~$d_{j+5}$ are accepted. Then prepare the MTP module inputs.  Different from the context phase, we will prepare input IDs and hidden states of a total of $K$ tokens before the last accepted token. In this example, the last accepted token is $t_{j+6}$. Then we can get the first draft token after the first MTP module forward. For the sequential MTP modules, we can prepare their inputs in a similar way to the MTP modules in the context phase, so all of those MTP modules have the same input sequence length. After predicting all of the draft tokens, we need to evict the keys/values of those rejected draft tokens from the main model's KV cache to ensure the subsequent calculation is correct.

### MTP Eagle

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_mtp_eagle.png" alt="tech_blog2_mtp_eagle" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 3. MTP Eagle, using the same notation as Figure 2</em></sub></p>

MTP Eagle can be viewed as a variant of [Eagle](https://arxiv.org/pdf/2401.15077) speculative decoding method, but only supports chain decoding now. It reuses the same MTP module and repeats multiple times to predict draft tokens. MTP Eagle supports the model checkpoint with only one MTP module. The official DeepSeek-V3 and DeepSeek-R1 have only one MTP module in their checkpoints. Another difference with MTP vanilla is the KV cache. In the MTP Eagle method, the MTP module reuses the same KV cache when predicting multiple draft tokens.

Figure 3 gives an MTP Eagle example. In the context phase, the inputs of the first MTP module forward are the same as the MTP Vanilla. However, for the sequential MTP module forward, the first difference is that MTP Eagle uses the same MTP module to predict draft tokens and reuses the same KV cache. Another difference is that we only need to input the token ID and the hidden state of one token. The token is the last predicted draft token, while the hidden state is the corresponding hidden state in the last MTP module forward. In this way, we can predict total K draft tokens by using only one MTP module.

In the generation phase, the verification stage is the same as MTP Vanilla. Once we get the accepted tokens, we use all of them along with their corresponding hidden states as inputs for the first MTP module forward. Unlike MTP Vanilla, which needs to store past tokens and hidden states, this approach is much easier to implement. Subsequent MTP module forwards follow the same input preparation method as the context phase. After predicting all draft tokens, we need to evict the key/value pairs of any rejected draft tokens from the main model’s KV cache.

## MTP implementation in TensorRT LLM
### Basic Implementation
TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_overall_workflow.png" alt="tech_blog2_overall_workflow" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 4. MTP workflow in TensorRT LLM</em></sub></p>

Figure 4 shows the overall workflow of MTP in TensorRT LLM. Both paths share the runtime workflow, and the differences are in the MTP modules forward. In the context phase, there is no draft token in the inputs. TensorRT LLM model engine fetches the input IDs from the requests and inputs to the model engine forward to get the next token and the hidden state. Then we prepare the MTP module inputs, and the MTP modules forward the inputs to predict the draft tokens.

The generation workflow is more complicated. We need to do both the verification and draft stages. The predicted new token and draft tokens are the inputs for the main model. After the main model forward, we can sample from the output logits and get the following new tokens. Then compare them with the input draft tokens to get the final accepted tokens. The verification stage will be finished here. We will use the accepted tokens and hidden states to start a new draft stage, which uses the MTP layers to predict new draft tokens for the next iteration. Finally, we need to rewind the KV cache to evict keys/values corresponding to those rejected tokens.

Except for the Rewind KV Cache, all of those processes are inside the model engine forward function. In this way, we can use one model engine to support MTP inference, and it would be easier for MTP to be compatible with other features, such as CUDA graph and overlap scheduler. When enabling CUDA graph, both the verification and draft stages can be captured in one graph, significantly reducing CPU overhead.

### MTP Modules

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_mtp_modules.png" alt="tech_blog2_mtp_modules" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 5. MTP model architecture</em></sub></p>

Figure 5 introduces the basic model architecture of [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326), [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047), and the basic [MTP module](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L829) design. Because MTP vanilla needs $K$ input tokens, if the number of accepted tokens is less than the number of input tokens, i.e. $j<K$, we need to use the old token IDs and hidden states as the input of the first MTP module. To avoid bringing much additional computation overhead, we add two tensors for each request to save the past $K$ input IDs and the hidden states of past $K$ tokens, and update them by using the accepted tokens and corresponding hidden states each iteration. In this way, we can read these tensors when preparing inputs for the first MTP module. MTP Eagle implementation is much easier and straightforward, just call the same MTP module forward $K$ times to get $K$ new draft tokens.

The MTP module follows the design in DeepSeek-V3. The embedding layer and output head in MTP modules are shared with the main model, which can save GPU memory consumption.


### Attention for MTP

Attention is also a very important component in supporting MTP inference. The changes are mainly in the attention kernels for the generation phase. For the normal request, there will be only one input token in the generation phase, but for MTP, there will be $K+1$ input tokens. Since MTP sequentially predicts additional tokens, the predicted draft tokens are chained. Though we have an MTP Eagle path, currently, we only have the chain-based support for MTP Eagle. So, a causal mask is enough for the attention kernel to support MTP. In our implementation, TensorRT LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance.

### How to run DeepSeek models with MTP
Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:

```bash
cd examples/llm-api
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
```

To benchmark min-latency performance with MTP, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:

```bash
YOUR_DATA_PATH=<your dataset file following the format>

cat >./config.yml<<EOF
cuda_graph_config: {}
moe_config:
  backend: TRTLLM
speculative_config:
    decoding_type: MTP
    num_nextn_predict_layers: 3
EOF

export TRTLLM_ENABLE_PDL=1

trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
    throughput \
    --dataset $YOUR_DATA_PATH \
    --backend pytorch \
    --num_requests 10 \
    --concurrency 1 \
    --max_batch_size 1 \
    --tp 8 \
    --ep 2 \
    --config ./config.yml
```

## MTP optimization - Relaxed Acceptance
DeepSeek-R1 is a reasoning model that first outputs some thinking tokens, after which the user can get the actual outputs. The thinking process usually takes up a lot of tokens, and the quality of the outputs of the thinking process may have a limited impact on the final answer. So we want to use a more aggressive acceptance strategy, called [relaxed acceptance](https://github.com/NVIDIA/TensorRT-LLM/pull/3865), for the thinking process to speed up the thinking decoding phase. This will be a tradeoff between speedup and output quality. From the experimental results, the impact of relaxed acceptance on output quality is limited.

### Relaxed Acceptance

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_relaxed_acceptance.png" alt="tech_blog2_relaxed_acceptance" width="1024" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 6. Relaxed Acceptance example. Use MTP nextn=4 and top-3 in this example.</em></sub></p>

In previous verification and acceptance, we will use a top-1 to sample from the logits the main model to get the “expected” tokens as shown in Figure 1. There will be only one choice to compare with the draft tokens, which we call “Strict Acceptance”.

As for the Relaxed Acceptance, we first get the top-N tokens sampled from the logits, so more candidates will be compared with the input draft tokens. To make sure the accepted tokens are as accurate as possible, we also added a probability threshold, i.e., delta. We can get the token probabilities by applying a softmax to the logits. After getting the top-N tokens, we will remove tokens from the candidate list if their probability is smaller than the (top-1 probability - delta). In this way, we may get more than one token candidate, and all of those tokens are with a high probability. Then we can compare the input draft tokens with those candidates. If one of them matches, we can accept this draft token, so the acceptance rate will be increased. Figure 6 shows an example of a comparison between Strict Acceptance and Relaxed Acceptance.

Note that the Relaxed Acceptance will only be used during the thinking phase, while the Strict Acceptance will still be used during the non-thinking phase. And the Relaxed Acceptance only supports the DeepSeek-R1 model now.


### How to run the DeepSeek-R1 model with Relaxed Acceptance

Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:

```bash
cd examples/llm-api
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
```

To benchmark min-latency performance with MTP Relaxed Acceptance, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:

```bash
YOUR_DATA_PATH=<your dataset file following the format>

cat >./config.yml<<EOF
cuda_graph_config: {}
moe_config:
  backend: TRTLLM
speculative_config:
    decoding_type: MTP
    num_nextn_predict_layers: 3
    use_relaxed_acceptance_for_thinking: true
    relaxed_topk: 10
    relaxed_delta: 0.6
EOF

export TRTLLM_ENABLE_PDL=1

trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
    throughput \
    --dataset $YOUR_DATA_PATH \
    --backend pytorch \
    --num_requests 10 \
    --concurrency 1 \
    --max_batch_size 1 \
    --tp 8 \
    --ep 2 \
    --config ./config.yml
```

## Evaluation
### Achieving speedup with MTP speculative decoding

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_perf_and_ar.png" alt="tech_blog2_perf_and_ar" width="1280" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 7. DeepSeek-R1-FP4 671B min-latency performance with different MTP next-n</em></sub></p>

We tested the min-latency (batch size = 1) performance of the DeepSeek-R1-FP4 model with different MTP next-n on a B200 node. The MLA runs with TP=8, and the MoE runs with EP=2. And there are ten different requests with ISL/OSL=1K/2K. From Figure 7, we can see that MTP=3 can help get the best min-latency performance on 8 B200 GPUs, which can bring 2.16x speedup compared with the baseline nextn=0. And with the help of the relaxed acceptance, the min-latency performance can be further improved to achieve a 2.33x speedup. We also evaluated the CUDA graph and overlap scheduler benefits. For such a min-latency case, CUDA graph can achieve a 7.22x average speedup, while the overlap scheduler can achieve 1.03x average latency.

### Accuracy studies for Relaxed Acceptance

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_acc_relaxed_acceptance.png" alt="tech_blog2_acc_relaxed_acceptance" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 8. Ablation results for the Relaxed Acceptance. Using MTP nextn=3, top-10, and delta=0.6.</em></sub></p>

We validated the Relaxed Acceptance on different datasets. In Figure 8, we show the ablation results for Relaxed Acceptance by using the DeepSeek-R1-FP4 model. Compared with Strict Acceptance, the impact of Relaxed Acceptance on output quality is limited, resulting in only a slight accuracy drop.

## Future Works
### Tree-based speculative decoding support

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog2_tree_spec_decoding.png" alt="tech_blog2_tree_spec_decoding" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 9. Comparison between the chain-based and tree-based speculative decoding</em></sub></p>

TensorRT LLM PyTorch backend can only support chain-based speculative decoding now, both MTP Vanilla and MTP Eagle. However, the tree-based speculative decoding technique is widely used in previous advanced methods, such as Ealge2 and Eagle3, to increase the acceptance rate. MTPs in TensorRT LLM can also be extended to support the tree-based technique. Figure 9 compares the chain-based method with the tree-based method. Both full tree and dynamic tree methods can help expand the candidate combinations, so that we can have more choices for the draft tokens.

### Eagle3 support

Another important method is Eagle3. From the [Eagle3 paper](https://arxiv.org/pdf/2503.01840), the promising results show that it can help greatly increase the acceptance rate by leveraging different levels’ hidden states to predict draft tokens. Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035) now, in the future, we also want to train an Eagle3 head to support DeepSeek-V3/R1+Eagle3 to achieve better speedup.

## Acknowledgment

This was a remarkable cross-team effort to support and optimize MTP in TensorRT LLM. We would like to extend our gratitude to everyone who contributed to making this possible, as it involved a typical system/algorithm co-design approach spanning multiple technical layers—including kernel optimization, runtime enhancements, algorithmic improvements, and performance measurement & analysis. And a special thanks goes to the DeepSeek team for developing the MTP method, which lays down the foundation of this blog.

---

# Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers

By NVIDIA TensorRT LLM team
## Table of Contents
- [Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers](#optimizing-deepseek-r1-throughput-on-nvidia-blackwell-gpus-a-deep-dive-for-developers)
  - [Table of Contents](#table-of-contents)
  - [Introduction](#introduction)
  - [Precision strategy](#precision-strategy)
  - [Parallel strategy](#parallel-strategy)
    - [Weights absorb and MQA](#weights-absorb-and-mqa)
    - [Data Parallel for Attention module (ADP)](#data-parallel-for-attention-module-adp)
    - [Expert parallel for MoE (EP)](#expert-parallel-for-moe-ep)
  - [MLA Layers Optimizations](#mla-layers-optimizations)
  - [MoE Layers Optimizations](#moe-layers-optimizations)
  - [Runtime Optimizations](#runtime-optimizations)
  - [How to reproduce](#how-to-reproduce)
  - [Future Works](#future-works)
  - [Acknowledgment](#acknowledgment)

## Introduction
The open source DeepSeek R1 model's innovative architecture including the multi-head latent attention (MLA) and large sparse Mixture-of-Experts (MoE) significantly improved the inference efficiency of the LLM models. However, harnessing the full potential of such an innovative structure requires equally important hardware/software co-optimization. This post delves into the optimization strategies for DeepSeek R1 throughput oriented scenarios (TPS/GPU), developed by NVIDIA within TensorRT LLM on NVIDIA's Blackwell B200 GPUs. We will explore the rationale behind each enhancement. [The other min-latency optimization blog](./blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) explained in detail how TensorRT LLM optimizes the R1 performance to achieve the best of the TPS/USER.

These optimizations have significantly boosted DeepSeek R1 throughput on Blackwell. Performance increased from approximately 2000 TPS/GPU in February to 4600 TPS/GPU on ISL/OSL 1K/2K dataset. The optimizations are general and applicable to other ISL/OSL configs too. These optimization items were broadly categorized into three areas: MLA layers, MoE layers, and runtime.

## Precision strategy

The mixed precision recipe for DeepSeek R1 throughput scenario is almost the same as [what](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md#precision-strategy) is used for latency oriented scenario, with the following differences:

* FP8 KV cache and FP8 attention, rather than BF16 precision.
* FP4 Allgather for better communication bandwidth utilization.

The checkpoint used in this blog is hosted in [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), generated by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer). The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are:

| Precision | GPQA Diamond | MATH-500
| :-- | :-- | :-- |
| TensorRT LLM FP8 | 0.697	| 0.954 |
| TensorRT LLM FP4 | 0.705	| 0.96 |

** Note there are some run-to-run variance for these evaluations, so FP4 data is slight higher here. We think FP4 has comparable accuracy with FP8 on these datasets.

The MoE layers inside this checkpoint have been quantized into FP4. Quantizing the MoE layer weights into FP4 has the following benefits:

* Fully utilize the 5th generation Tensor Core FLOPS of the NVIDIA Blackwell GPUs
* Reduce the memory load needs of the weights by almost half for MoE. Since the MoE parts are still memory bound for the decoding phase for the scenario, and 97% of the weights in the DeepSeek R1 model are from MoE layers.
* Reduce the memory footprint of the model weights, thus freeing more GPU memories for KV cache and then increasing the max concurrency. [The original FP8 model checkpoint of the DeepSeek R1 model](https://huggingface.co/deepseek-ai/DeepSeek-R1) is about 640GB, while the NVIDIA provided [DeepSeek R1 FP4 quantized model](https://huggingface.co/nvidia/DeepSeek-R1-FP4) is only about 400 GB.

The precision of FP8 KV cache and FP8 attention kernels are evaluated on the GSM8K dataset, with no obvious accuracy drops. For the accuracy numbers, please see the table in the FP8 KV cache section. Users can still opt-out to use BF16 KV cache and attention if on their dataset some accuracy differences are observed.

## Parallel strategy

The parallelism strategy for DeepSeek R1 throughput scenario is different from what is used for latency-oriented scenarios.

| Components | Parallel Patterns |
| :---- | :---- |
| Attention Modules | Data Parallelism 8 (DP8) |
| MoE Sparse Experts | Expert Parallelism 8 (EP8) |
| MoE Shared Experts | DP8 |
| Fuse_A GEMM | DP8 |
| Router GEMM | DP8 |

In the following sections we will explain the rationale why DP and EP are chosen and not using tensor parallel (TP).

### Weights absorb and MQA

The core idea of MLA is the low-rank joint compression for the attention keys and values to reduce KV-cache size during the inference. Based on the MLA formulas, the down-projected KV latent is up-projected to multiple heads and combined with the up-projected Q to establish a normal multi-head attention (MHA). Due to the nature of the matrix multiplication, the up projection weights matrix of the K (W^UK) can be multiplied by the up-projection weights matrix of Q (W^Q) firstly, the computed results of these 2 can be then multiplied to Q. The up-projection weights matrix of V (W^UV) and the attention output projection matrix W^O can also be multiplied after the attention output. The DeepSeek-V2 technical report calls this technique "absorb". After the weights are absorbed, the MLA is equivalent to multiple query attention(MQA). Please see the [original DeepSeek-V2 technical paper](https://arxiv.org/pdf/2405.04434) for the detailed formulas and explanations, the following block diagram shows the computational flow of weights absorbed MLA in TensorRT LLM.
![Weights Absorb](../media/tech_blog3_mla_absorb.png "Weights Absorbed MLA")

For the decoding phase, the weights absorb significantly reduces the math FLOPS needed to up project the K and V, since the FLOPs needed for these up projections of KV are linear to the KV cache length, while length of Q vector is always 1 in the decoding phase. The longer the KV cache history is, the more FLOPs are needed, and the up projections are repeated for every decoded token since only the projected KV latent were saved, which further increases the FLOPs needed.
For the prefill phase, the weights absorbed version changes the dimensions of Q and KV thus increasing the number of FLOPs for attention. Based on roofline analysis, non absorbed version is beneficial for the prefill phase with input length 256 or larger
The TensorRT LLM MLA implementation chooses different highly optimized kernels for prefill and decoding, see [MLA](../../../../tensorrt_llm/_torch/modules/attention.py).

### Data Parallel for Attention module (ADP)

The intuition of choosing attention DP is that doing TP for the MQA (where different GPUs compute different attention Q heads) will duplicate the KV cache memory, which limits the concurrency being achieved by the system. The duplication factor is equal to the TP group size, thus 8x for TP8. Small concurrency will hurt the throughput for the powerful system like NVIDIA DGX B200.

For DeepSeek R1 FP4 checkpoint with 8 B200 GPUs, the weights and activation occupies about 80 GB memory for each GPU, and the free KV cache per GPU will be 100GB. Assuming ISL 1K, OSL 2K, each request will consume about 200MB KV cache, which results in a per GPU max concurrency of 500. A single node 8xGPU system has a global concurrency of 4000. When using attention TP, the global concurrency will become just 500.

Silicon experiments show the attention DP technique provides a significant **400% speedup** in the max throughput cases, when keeping all other factors the same.

### Expert parallel for MoE (EP)

The DeepSeek R1 MoE design features 256 small sparse experts and 1 shared expert, the GEMM problem size of these experts are as follows.

| GEMM | group | GEMM N | GEMM K |
| :---- | :---- | :---- | :---- |
| shared_fc1 | 1 | 4096 | 7168 |
| shared_fc2 | 1 | 7168 | 2048 |
| sparse_fc1 | 256 | 4096 | 7168 |
| sparse_fc2 | 256 | 7168 | 2048 |

These experts can be done in either Tensor-Parallelism or Expert-Parallelism ways. Our current ablation study reveals that Expert-Parallelism achieves better GEMM FLOPS because it has better GEMM problem sizes. And Expert-Parallelism can save GPU communication bandwidth compared to AllReduce, because the tokens only need to be sent to GPUs where the active experts for this token are located, while TP needs an AllReduce for all the tokens between all the GPUs. Also to be noted that, to scale the DeepSeek R1 inference to systems like GB200 NVL72 fully utilizing the aggregated memory bandwidth and tensor core flops, large EPs are needed. We are actively working on implementing it.

Silicon performance measurements show that Expert-Parallelism can provide 142% speedup for 1K/2K max throughput case, when keeping other factors the same.

## MLA Layers Optimizations

Other than the parallel strategy and precision strategy we explained above, we have done the following optimizations for layers/kernels inside the MLA module.

* Attention Kernels Optimization

    This provided a **20% E2E speedup** compared to February baseline implementation. It involved implementing **high-throughput generation MLA kernels**. Techniques include using 2CTA Group variant of the Tensor Core 5th MMA instructions of Blackwell GPUs, overlapping MLA with softmax using interleaved tiles, and fine-tuning kernel selection heuristics for the DeepSeek R1 problem size.

* FP8 KV Cache

    An important optimization that yielded a **6% E2E throughput increase** when assuming the concurrency was identical. Another benefit of FP8 KV cache is **compressing the KV cache size by half**, which **allows for larger concurrency**. It also enables the use of faster FP8 attention kernels compared to BF16. We recommend that users always turn on FP8 KV cache to get better performance. In the context phase, KV is quantized to FP8 and saved to the KV cache pool. In the generation phase, both Q and KV are quantized to FP8, and FP8 Multi-Query Attention (MQA) is used. Evaluation on GSM8k showed **no meaningful accuracy drop**. The quantization typically uses static per-tensor FP8 with a scaling factor defaulting to 1.0, but KV cache scaling factor can also be generated by calibrating on a target dataset. Below are the accuracy metrics of different combinations on the GSM8K dataset.

    | KV Cache Type | FP8 Checkpoint | FP4 Checkpoint |
    | :---- | :---- | :---- |
    | BF16 MLA and KV cache | 0.9629 | 0.9606 |
    | FP8 MLA and KV cache | 0.9613 | 0.9606 |


* Manual GEMM tactics tuning

    This optimization addresses cases where the default heuristic algorithm in cuBLAS is not performing best for specific GEMM shapes existing in the model. We built an internal tool to find the best algorithm for these specific shapes offline and then used the `cublasLtMatmul` API to apply this specific, optimized algorithm at runtime. This is a necessary system optimization when general-purpose heuristics don't find the most efficient kernel for all specific cases. We are also working actively with the cuBLAS team to further enhance the heuristics such that the best performance can always be achieved OOTB. See [cublasScaledMM.cpp](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/thop/cublasScaledMM.cpp#L54) for the tuning details.

* Horizontal Fusions

    This involves fusing GEMM operations of down projection of Q/KV and rope dimensions of K tensor. See [modeling_deepseekv3.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L1305) for details. Horizontal fusion reduces the kernel launch overhead and increases the GEMM problem sizes which can achieve better HW utilization. It is a common technique shared by both min-latency and throughput optimizations.

* 2-stream optimizations

    There are some small operations which can be run in parallel like the Q norm and KV norm inside the MLA. These operations cannot fully utilize the GPU math flops and the memory bandwidth, thus running in parallel CUDA streams can bring speed-up.

## MoE Layers Optimizations

The following optimizations are already done for MoE layers.

* Mix I/O data type for the router GEMM

    Achieved a **4% E2E speedup** by avoiding casting operations and performing the GEMM using a mixture of input and output data types (e.g., BF16 input and FP32 output) directly. This eliminates the need to explicitly cast inputs to the output type and saves memory bandwidth.

* Top-K Kernels Fusions

    Resulted in a **7.4% E2E speedup**. For DeepSeek R1, selecting the top 8 experts from 256 is done in a two-phase approach: first selecting top groups, then finding the top 8 within those groups. DeepSeek R1 uses some additional techniques for better expert load balance which involves adding bias and scales to the topK complications. All these operations resulted in 18 PyTorch ops when not fused, see [Deepseekv3RoutingImpl](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L213). Fusing the multiple kernels involved in these Top-K calculations significantly reduces the overall computation time. Compared to using 18 native PyTorch ops, fusion can reduce the operation to as few as 2 kernels. Based on the measurement on B200, fusing these kernels can reduce the kernel time from 252us to 15us in the target setting.

* FP4 AllGather Optimizations

    Showed a **4% E2E speedup**. This optimization replaces the BF16 AllGather operation with an FP4 version. Using a lower precision for this communication primitive reduces the amount of data transferred over the network, significantly improving communication efficiency. Also, since the original BF16 Tensor to be transferred will get cast into FP4 format after the AllGather communication, this optimization will not bring any impact to the accuracy. At the kernel level, we are seeing about 3x when switching from BF16 to FP4 AllGather.

* CUTLASS Group GEMM optimizations

    Provided a **1.3% E2E speedup**. There are some CUTLASS level optimizations shared by both min-latency and throughput cases. Just updating CUTLASS to the latest version gives us 13% kernel improvement for the MoE groupGemm, and resulted in +1.3% E2E TPS/GPU.

* Multi-stream optimizations
    Running the shared and routed experts in 2 streams combined with other multi-streaming optimizations in the MLA modules, contributing a **5.3% E2E speedup**.

## Runtime Optimizations

These optimizations target the overall execution flow, scheduling, and resource management within the inference system. They are shared between DeepSeek R1 models and other models supported in the TensorRT LLM, here we are sharing some ablation study for the performance benefits on DeepSeek R1 on B200.

* CUDA Graph

    This had a significant **22% E2E performance impact** for throughput scenarios.

    CUDA Graphs allow capturing a sequence of CUDA operations and launching them as a single unit, drastically reducing kernel launch overheads. This is particularly beneficial for models with many small kernels, and particularly on the PyTorch flow, because the python host code normally executes slower than C++. Since the CUDA Graph freezes the kernel launch parameters, which is normally associated with the tensor shapes, it can only be safely used with static shape, meaning that different CUDA graphs need to be captured for different batch sizes. Each graph will have some cost of memory usage, and capturing time, thus we cannot capture every possible CUDA graph for all possible batches. For the non-captured batch sizes, PyTorch eager mode code will be executed.

    There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.

    Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n  enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)

* Overlap Scheduler:

    Showed a **4% E2E performance impact** and should generally **always be used**. This scheduler manages the execution of different operations (like computation and communication) to overlap them effectively on the GPU and network. The intuition is to hide latency by performing computation while waiting for data transfers or vice versa, improving overall hardware utilization. The overlap schedule is already defaulted on in TensorRT LLM by [commit](https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428#diff-3c4f29d6594b37af0f1fbb97f5291b18e49f3f2510f9d296c7adb2829e9da0bf). In case there are corner cases where it does not work, users can still opt-out this feature by set *disable_overlap_scheduler* to true.

* Memory Optimizations

    Resulted in a **4GB improvement**. This includes techniques like chunked MoE (specifically for Hopper) and fixing a cuda context init bug. These methods reduce the memory footprint of the model weights or intermediate tensors, allowing for larger batch sizes or sequence lengths, and preventing Out-of-Memory (OOM) errors.

## How to reproduce

See [Perf practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput)

## Future Works

- Large EP
- Chunked context
- More communication overlap

## Acknowledgment

The substantial throughput advancements for DeepSeek R1 on Blackwell GPUs, as detailed in this post, are the fruit of a dedicated and collaborative engineering effort. Achieving nearly a 2.3x increase in TPS/GPU required a deep dive into MLA layers, MoE layers, and runtime optimizations. We extend our sincere appreciation to all the engineers involved in this intensive optimization process. Their collective expertise in pushing the boundaries of throughput performance within TensorRT LLM has been instrumental. We trust that sharing these specific strategies for maximizing throughput will prove beneficial to the developer community as they tackle demanding LLM inference workloads on NVIDIA hardware.

---

# Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)

By NVIDIA TensorRT LLM Team

## Table of Contents
- [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](#scaling-expert-parallelism-in-tensorrt-llm-part-1-design-and-implementation-of-large-scale-ep)
  - [Table of Contents](#table-of-contents)
  - [Motivation for large-scale EP](#motivation-for-large-scale-ep)
    - [Observations over one machine translation dataset](#observations-over-one-machine-translation-dataset)
    - [Observation over GSM8K dataset](#observation-over-gsm8k-dataset)
  - [High-level design introduction](#high-level-design-introduction)
  - [EP communication kernels](#ep-communication-kernels)
    - [Motivation of EP communication kernels for GB200](#motivation-of-ep-communication-kernels-for-gb200)
    - [EP communication kernels implementation](#ep-communication-kernels-implementation)
  - [EP Load Balancer](#ep-load-balancer)
    - [Python Interface](#python-interface)
    - [C++ extension](#c-extension)
    - [Core implementations of the host logic](#core-implementations-of-the-host-logic)
    - [Core implementations of the GPU logic](#core-implementations-of-the-gpu-logic)
    - [Online EP Load Balancer](#online-ep-load-balancer)
    - [Offline EP Load Balancer](#offline-ep-load-balancer)
  - [E2E evaluation](#e2e-evaluation)
    - [The effect of EP Load Balancer](#the-effect-of-ep-load-balancer)
      - [Offline EP Load Balancer](#offline-ep-load-balancer-1)
      - [Online EP Load Balancer](#online-ep-load-balancer-1)
    - [Performance study](#performance-study)
  - [Reproducing steps](#reproducing-steps)
    - [The effect of EP Load Balancer](#the-effect-of-ep-load-balancer-1)
        - [Step 1: Run inference and collect statistics](#step-1-run-inference-and-collect-statistics)
        - [Step 2: Generate the EPLB configuration](#step-2-generate-the-eplb-configuration)
        - [Step 3: Run inference with the EPLB configuration](#step-3-run-inference-with-the-eplb-configuration)
    - [Miscellaneous](#miscellaneous)
  - [Expanded thoughts](#expanded-thoughts)
  - [Acknowledgement](#acknowledgement)

The development of model like DeepSeek-V3/R1, which use large-scale fine-grained Mixture-of-Experts (MoE) designs, has significantly advanced open-source model quality. Newly released open-source models such as LLaMA4 and Qwen3 also adopt a similar large-scale fine-grained MoE design principle. However, large-scale MoE models introduce new challenges for inference systems, including high memory demands and inherent expert-level workload imbalance.

In the past, we have shared TensorRT-LLM’s optimization experience to [push the latency boundary](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) of DeepSeek R1 model, [the implementation and optimization of MTP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md)(Multi-Token Prediction) and [the optimizations for DeepSeek R1 throughput oriented performance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md).

The DeepSeek team has also shared their valuable experience and practice on how to optimize this kind of large-scale Expert Parallelism (EP) model, including [DeepEP](https://github.com/deepseek-ai/DeepEP) and [EPLB](https://github.com/deepseek-ai/EPLB). Also, the DeepSeek team has shared their concrete design considerations in [this](https://arxiv.org/abs/2412.19437) tech report. On top of those great sharings, there are also nice community efforts to implement large-scale EP in other inference engines, such as [this](https://lmsys.org/blog/2025-05-05-large-scale-ep/) effort from the SGLang team.

In this tech blog, we will introduce the details of the design and implementation to support E2E large-scale EP in TensorRT LLM . This blog post mainly covers the following:

* How to leverage NVIDIA GB200 Multi-Node NVLink (MNNVL) HW features to implement high-performance communication kernels.
* How to design and implement an online expert workload balancer to dynamically balance the expert load distribution and adapt to the changes of online traffic patterns. We present:
  * The empirical data analysis demonstrating the need to do so.
  * The implementation of the online traffic data statistic module.
  * The design and implementation of the replication/placement strategy.
  * The MoE weight load/re-distributer to balance the online workload across multiple GPUs.
  * The changes needed to the MoE router and computation module to adapt to the expert load balancer needs.
  * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM .

In future tech blogs, we will also cover the following topics:
* The introduction of performance tuning and optimization for TensorRT LLM large-scale EP GB200 implementation.
* How to implement efficient large-scale EP support for B200/Hopper and other NVIDIA GPUs without MNNVL.
* The best practices to leverage large-scale EP and get performance gains.
* How to combine large-scale EP with other system optimization techniques.


Even if, in this tech blog, we focus on TensorRT LLM , we believe the core ideas and implementation can also be applied to other inference engines to help the inference performance on NVIDIA GPUs. Also, with the help of the community, we would like to figure out how to better modularize the current TensorRT LLM large-scale EP implementation and make it more easily reusable by the community.

Finally, in this tech blog, there are implementation details which are targeted towards the GB200 system, such as the communication components leveraging the GB200 MNNVL inter-GPU connection, and the MoE weight load/re-distributer module leveraging the high bandwidth C2C connection between Grace CPU and Blackwell GPU. Nevertheless, the overall design principle and software architecture can still apply to non-GB200 NVIDIA GPU systems. To facilitate the extension to other non-GB200 system, we have, on purpose, paid attention to the generalization of the design and implementation. These changes should be easily composable with other existing components.

## Motivation for large-scale EP


The main motivation of introducing large-scale EP (here means EP \> 8\) comes from the following system considerations:

* We expect to reduce the execution latency thanks to the increased aggregated memory bandwidth to load the expert weights.
* We expect to increase the effective batch size to saturate the GPU computing power.

Note that **when the E2E execution time is dominated by the MoE GroupGEMM computation, by introducing large-scale EP, it is expected to see clear performance benefits. But if the E2E execution time is not dominated by the MoE GroupGEMM computation, then large-scale EP may bring limited performance benefit.**


Also there isn't free lunch in the system design. When the EP size increases up to greater than 8 (sometimes even less than 8), due to the sparsity execution nature of MoE models, it can inherently trigger the EP-level workload imbalance issue.

And here are some empirical observations based on some datasets (*all the analyses below are done with the **DeepSeek R1 model**, on **32 GB200 GPUs**).*

### Observations over one machine translation dataset

Firstly let’s have an overview of the overall imbalance issues across layers:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture1.png">
</figure>
</div>
<p align="center"><sub><em>Figure 1: The routed token count from rank 0 to all the ranks(including rank 0), for decode iteration 1950, and all the MoE layers</em></sub></p>

In Figure 1, it can be seen clearly that for the MoE in layer 36, many more tokens are sent from **rank 0** to **rank 13\.**

If we zoom on the MoE in the layer 36 and record its activated expert rank distribution, there clearly is a rank that is more heavily activated:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture2.png">
</figure>
</div>
<p align="center"><sub><em>Figure 2: The tokens received for each expert rank for layer 36</em></sub></p>

If we flatten the data to see the routed tokens for each expert, we can see that a few experts are more active than others:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture3.png">
</figure>
</div>
<p align="center"><sub><em>Figure 3: The tokens received for each expert for layer 36</em></sub></p>

It is also interesting to see that this kind of imbalance issue is very stable across multiple iterations, as shown on the following figure:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture4.png">
</figure>
</div>
<p align="center"><sub><em>Figure 4: The accumulated token counts received for each expert for layer 36, within 50 decode steps, and the local batch size=256.</em></sub></p>

Clearly, the hot experts in Figure 4 are actually the same as in Figure 3 which only have data for a single decode iteration.
We have also done the duration-based analysis for local batch size=1 which correspond to a single request with observing the similar pattern:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture5.png">
</figure>
</div>
<p align="center"><sub><em>Figure 5: The accumulated token counts received for each expert  for layer 36, within 400 decode iterations, and the local batch size \= 1\.</em></sub></p>

To conclude the findings from this study over this machine translation dataset, we could say that:

* There are hot spots in some layers where the workload of some EP ranks can be much higher than others.
* This may be caused by the hottest expert or some hot experts to be located on the same rank.
* The routed token distributions can be the same for tens to hundreds of iteration steps or even more.
* For the execution of a single request, it also has the same hot experts between steps.

And another natural question is whether the above observation can change significantly on other datasets. So we have done a similar analysis with the GSM8K dataset.

### Observation over GSM8K dataset

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture6.png">
</figure>
</div>
<p align="center"><sub><em>Figure 6: The routed token count from rank 0 to all the ranks, for iteration 1950, and all the MoE layers</em></sub></p>

In Figure 6, compared with Figure 1, it can be seen that for GSM8K, the hot layer becomes layer 57 instead of layer 36\. Then what about the concrete status of layer 36 for the GSM8K dataset?

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture7.png">
</figure>
</div>
<p align="center"><sub><em>Figure 7: routed token counts from EP rank 0 to other EP ranks, still taking the iteration 1950, MoE layer 36 as the example</em></sub></p>

Clearly from Figure 7, it can be observed that the workload imbalance is different from what was observed for the different dataset (in Figure 2).
Based on Figure 8, it can be observed that the workload imbalance is relatively stable across multiple iterations on the GSM8K dataset too. It is the same as the previous machine translation dataset.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture8.png">
</figure>
</div>
<p align="center"><sub><em>Figure 8: The accumulated token counts sent from EP Rank 0 to all the ranks, for MoE layer 57 within 50 decode steps, local batch size=256</em></sub></p>

If we flatten the EP rank level data to expert-level data, we can have the following figure.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture9.png">
</figure>
</div>
<p align="center"><sub><em>Figure 9: The accumulated token counts received for each expert for layer 57, within 50 decode steps, and the local batch size=256.</em></sub></p>

The similar imbalance pattern also exists for a single request.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture10.png">
</figure>
</div>
<p align="center"><sub><em>Figure 10: The accumulated token counts received for each expert for layer 57, within 400 decode steps, for a single request</em></sub></p>

If we use another request, then we can still observe the expert imbalance issue, while the hot experts can be different with some in common (in this example it is expert 10).

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture11.png">
</figure>
</div>
<p align="center"><sub><em>Figure 11: The accumulated token counts received for each expert for layer 57, within 400 decode steps, for a single request</em></sub></p>

So combining the data analysis of two datasets, we have the following findings:

* EP level workload imbalance issue is common for large-scale EP inference on multiple datasets. And the EP imbalance severity can be different per layer. Also the EP imbalance issue is dataset sensitive.
* The EP rank level imbalance issue can be caused by a certain hottest expert or multiple hot experts staying on the same EP rank.
* The EP rank imbalance distribution is relatively stable across tens to hundreds of iterations.
* Though there is time-dimension stability of EP rank imbalance distribution, clearly different requests can have different EP imbalance distribution.

Based on these findings, they can lead to our design consideration of TensorRT-LLM’s large-scale EP implementation:

* By design the EP imbalance issue needs to be considered to assure great E2E performance.
* Online EP Load Balancer(rather than only a Offline EP Load Balancer implementation) based on the real-time online request traffic is essential to ensure the robustness of EP balancer.
* The time-dimension stability of EP rank imbalance distribution can be leveraged to re-distribute the MoE weights to different EP ranks in an efficient manner.

In the next section we will illustrate the high-level design.

## High-level design introduction

Based on the detailed analysis and study in section [Motivation of large-scale EP](#motivation-of-large-scale-ep), it can clearly be observed that expert imbalance in EP is a common pattern for large-scale EP. This EP imbalance can clearly impede the overall system performance in the following ways:

* The hot EP rank will consume more memory (for activations) which can limit the effective max batch size scheduled during the inference process.
* More data will be sent to/received from the hot EP rank.

Those issues can clearly result into a system-level congestion effect in which the hot EP rank will delay the overall E2E execution.

To make sure large-scale EP can run well, careful considerations are needed to minimize the EP imbalance issue. The overall design is as follows:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture12.png">
</figure>
</div>
<p align="center"><sub><em>Figure 12: the high-level design of TensorRT LLM large-scale EP</em></sub></p>

In this design, there are both CPU and GPU side logics:

* CPU side
  * Implement the Replication \& Placement algorithms **(Replication \& Placement Compute** component) to achieve a more balanced EP strategy. Those are rather classical algorithms for which CPU computation is more suitable. Furthermore, by offloading this computation to the CPU, the interference with the GPU can be reduced. In the future, machine-learning based algorithms may also be explored and additional design consideration may be needed. The **Replication \& Placement Compute** component will generate the **“Placement Info”** which will then be consumed by both the GPU **Routing** logic and the CPU **Update Weights \& Placement** component. The **Replication \& Placement Compute** component will consume the **Statistics Data** generated by the **Statistics** component which runs on the GPU.
  * Orchestrate the process (**Update Weights \& Placemen**t component) to update and reload the MoE weights from CPU host memory to GPU device memory. This component will also consume the **Placement Info** generated by the **Replication \& Placement Compute** component. Our scalable design allows us to reload the MoE weights from remote GPU memory via MNNVL or NIC.

* GPU side
  * This is the main execution workflow of inference. The following new GPU components are introduced with our design:
    * EP communication kernels. In Figure 11, those are the **Dispatch** and **Combine** components.
    * Online traffic data statistics collector (the **Statistics** component). This component collects the **Statistics Data** which is to be consumed by the **Replication \& Placement Compute** component.
    * The MoE router logic (the **Routing** component). It sends tokens to the activated experts. It needs to be adjusted to support the dynamic placement of MoE weights. It also consumes the **Placement Info** generated by the **Replication \& Placement Compute** component.
    * The MoE computation logic (the **MoE** component) also needs to be adjusted correspondingly.

* Careful synchronization between CPU and GPU components is needed to ensure the validity of the entire execution process ; particularly, to avoid hangs, as well as invalid or sub-optimal executions.

For the **Update Weights \& Placemen**t component, we identified two design choices:

* Bulk approach
  * In this approach, when the MoE weight redistribution logic starts, the inference taking place on the current serving instance will have to be paused until the MoE weight redistribution process finishes. We estimate that it can lead to approximately **0.5 \~** **1 second** online serving stalls ; causing in the worst-cases request timeouts.  This kind of timeout or stalls can be mitigated at the system level by routing the requests to other serving instances or just request replays.
* Layer-wise approach
  * In this approach, the MoE weight redistribution is done layer by layer such that at each decode iteration only certain layers (it can be configured) will be impacted by a redistribution of their MoE weights. With this design, it will take several iterations to re-balance the MoE weights of all the layers. We expect this approach to have almost no impact on the user experience.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture13.png">
</figure>
</div>
<p align="center"><sub><em>Figure 13: One example of the layer-wise MoE weight re-distribution</em></sub></p>

In our current system, we choose to implement **the layer-wise approach** to minimize the impact on the online user experience. The bulk approach should be much easier to implement and we will not discuss it in this tech blog.
To implement the layer-wise approach properly, we need to carefully evaluate the capability of different underlying HWs to decide on the concrete implementation.
Let’s use GB200 as an example. In Figure 14, we illustrate the communication bandwidth of different HW elements in a GB200 node.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture14.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 14: high-level topology of GB200 system</em></sub></p>

Using the DeepSeek R1 model as an example, with FP4 precision, each MoE expert occupies 24MiB of memory space. There are 256 experts per layer. In total, that's 58 MoE layers, plus 1 MTP layer. So the maximum amount of MoE weights which need to be redistributed, to achieve EP balance, is 348GiB.
One GB200 node has 480GB LPDDR5X memory for each Grace CPU. In total, that's 960GB of host memory across a NUMA domain. One GB200 node can host the entire MoE weights of a model like the DeepSeek R1 LLM in its CPU host memory. Based on it, the MoE weight redistribution can be done by moving the corresponding MoE weights from CPU host memory to GPU device memory.

Let's assume that we target **50ms** inter-token-latency (ITL) as our main latency constraint. Using back-of-the-envelope calculation, it can be computed that the amount of expert weights which can be moved from the MoE weight pool (can be kept in Grace CPU memory or GPU memory on another node) to the Blackwell GPU (to do the real MoE inference) for each decode iteration is:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture15.png" width="300" >
</figure>
</div>
<p align="center"><sub><em>Figure 15: The theoretical expert count to be updated for each iteration with following 50ms ITL constraints, by using different HW as pools to store the full MoE weight</em></sub></p>

Based on this analysis, and, if we rely on the Grace CPU memory on each node to store the MoE weight pool, for each decode iteration, the weights of up to 300 experts can be redistributed to each GPU on the same GB200 node.
Assuming our goal is to finish the MoE weight re-balancing for the full model within 5 decode iterations, here are some more concrete use-case studies:

* Use-case 1 (with balanced expert placement and no expert replication)
  * 64 GPUs with 4 Experts per GPU
  * 58 layers, 232 Experts per GPU
  * Need 47 Expert Update / Iter, all the methods can satisfy the latency goal.
* Use-case 2 (with both balanced expert placement and replication)
  * 64 GPUs or 72 GPUs with 5 Experts per GPU
  * 58 layers, 290 Experts per GPU
  * Need 58 Expert Update / Iter, all the methods can satisfy the latency goal.
* Use-case 3 (with both balanced expert placement and replication)
  * 36 GPUs with 8 Experts per GPU
  * 58 layers, 464 Experts per GPU
  * Need 93 Expert Update / Iter, all the method can satisfy the latency goal.

In summary, based on the theoretical analysis, using Grace CPU memory as the pool to hold the full size MoE weights should allow us to achieve the EP (Expert-Parallelism) re-balancing within 5 decode iterations. If we relax the requirements to 10 or more iterations, there can be even more system implementation flexibility.

Next we will introduce the implementation details of our large-scale EP system.

## EP communication kernels

We have evaluated multiple ways of implementing the EP communication kernels needed by large-scale EP, including DeepEP, other solutions and the development of an approach from scratch.

The current technical decision is:

* For GB200, we implemented a new set of [custom EP communication kernels](https://github.com/NVIDIA/TensorRT-LLM/pull/3504).
* For non-GB200 systems (such as B200 or Hopper), we chose to integrate DeepEP directly, with some potential enhancement.

 The considerations are:

* DeepEP is a great piece of work done by the DeepSeek team. When we started the TensorRT LLM large-scale EP efforts, our first focus was on GB200. We chose to implement our own custom EP communication kernels as it was easier to introduce optimizations requiring the GB200 MNNVL capability. Also, based on our current evaluation, DeepEP does not provide CUDA graph compatibility for all the scenarios. We believe that CUDA graph is needed for the scenario we are interested in.
* When we started the efforts to enable large-scale EP on Hopper, we concluded that DeepEP could be adapted and meet our needs on this platform. We plan to extend DeepEP to work for B200 in the future.

We are also actively evaluating the possibility of consolidating GB200 and non-GB200 EP communication kernels into a single solution to make the system simpler, and we will keep the community posted on the status.
Now let’s talk a little bit more about the optimizations introduced into the custom EP communication kernel implementations.

### Motivation of EP communication kernels for GB200

In the Decoding Phase with Prefill-Decoding (PD) separation, we observed that the batch size may not be very large, such that latency is a significant concern. In this context, compatibility with CUDA Graph is a strong requirement.
[NCCL](https://github.com/NVIDIA/nccl) is a great GPU communication library which provides highly efficient communication kernels and primitives.
For now, its Send and Recv operations require the data size to be explicitly specified when invoking with `ncclSend`/`ncclRecv`.
However, in large expert parallel (large-EP) scenarios, the data size to be transferred is determined dynamically based on the model's output at each iteration.
With the current NCCL's communication interface, an explicit synchronization is required to send the communication size back to the CPU and launch NCCL calls from the CPU with the corresponding data size. This would break CUDA Graph compatibility.
This limitation has forced us to develop high performance communication kernels compatible with CUDA graph and that can accept communication sizes directly from GPU memory.
We also wanted those kernels, for GB200, to take of advantage of the MNNVL's memory bandwidth.

### EP communication kernels implementation
Our kernels adopt a communication approach similar to NCCL’s LL128 primitive. As this approach strikes a good balance between latency and bandwidth, it is well-suited for LLM inference.
Our custom kernels can read the communication size directly from GPU memory and are compatible with CUDA Graph even when the data size varies across runs.

In our implementation, we use the CUDA's Driver API to establish a peer-to-peer (P2P) buffer via MNNVL as a workspace.
Each GPU can access the workspace of other GPUs. The workspace is divided into multiple channels, each assigned to a remote GPU as a write buffer.
Those write buffers are used in a FIFO manner, with flags used to synchronize FIFO status and avoid data corruption.
More details can be found in [PR 3504](https://github.com/NVIDIA/TensorRT-LLM/pull/3504).

## EP Load Balancer

TensorRT LLM implements a set of functionalities to achieve EP Load Balancing. There are several key components:

### Python Interface

The Python interface layer provides a user-friendly PyTorch/Python native interface to access the MoE Load Balancing implementations, such as the Python wrapper for the GPU/CPU synchronization logics and the online data statistics collection, and other logics implemented in 4.2 to 4.4.

### C++ extension

The C++ extension acts as the bridge between the PyTorch/Python interface and the C++/CUDA core implementations.

### Core implementations of the host logic

The host-side core logic implements the following key parts:

* Load balancing algorithms
  * Replication algorithm
  * Placement algorithm
* Orchestration logic of MoE weight updates
* MoE weight update logic

### Core implementations of the GPU logic

The GPU core logic contains the following components:

* Online traffic statistics collection
  * To reduce the CPU-GPU back-and-forth synchronization cost, we choose to implement the online traffic statistic logic on the GPU side.
* Expert routing logic
  * The MoE routing logic needs to be enhanced to adapt with the dynamic EP balance impact.

There are GPU/CPU synchronization components implemented. More details can be found in [PR 4384](https://github.com/NVIDIA/TensorRT-LLM/pull/4384) and [PR 4495](https://github.com/NVIDIA/TensorRT-LLM/pull/4495).

Based on these core utilities, there are two versions of EP Load Balancer in TensorRT LLM : Offline EP Load Balancer and Online EP Load Balancer.

### Online EP Load Balancer

For production deployment needs, Online EP Load Balancer is recommended since it can adapt itself to the change in the online traffic pattern, dynamically, thus with more performance guarantees.

However, the Online EP Load Balancer faces several challenges.

First, load balancing introduces dynamic Expert placement. A single Expert’s location may shift based on current workload. For example, if Expert 0 and Expert 1, originally assigned to Rank 0, both become hot experts, the load balancing policy might redistribute them to different ranks alongside cold experts, which necessitates timely updates to the weight data.

We aim for the Online Load Balancer to react swiftly to changes in request patterns and adjust Expert assignments to avoid load imbalance issues. Importantly, we do not want the balancing process to interfere with the online inference execution process, nor do we want to employ a "Stop-The-World" (Bulk) strategy for updating weights.

In large MoE models (such as DeepSeek R1) during the decoding phase, batch sizes are often small, making CUDA Graph an effective acceleration method; especially when high TPS per user is required. This benefit is even more pronounced on platforms like GB200. For this reason, we want the entire load balancing mechanism to be compatible with CUDA Graph.

To avoid invalidating pre-captured CUDA Graphs, we perform in-place weight updates by writing new Expert weights into the same memory locations, rather than swapping out tensor pointers. This ensures the weights tensor address remains unchanged in the Model Engine.

In this design, each Expert Slot serves as a container for holding an Expert’s weights, decoupled from any specific Expert. The number of Expert Slots must be greater than or equal to the total number of Experts so that each Expert always has at least one available Slot. Hot Experts may occupy multiple Slots. Each Slot is identified by a SlotId.

Since the MoE model's routing logic outputs ExpertIds (not SlotIds), we maintain a routing table from ExpertId to SlotId which is updated by the load balancing policy, periodically. The Load Balancer Routing module uses the current routing table (Expert replication information and slots) to map each token to a suitable Expert Slot.

To make weight updates non-blocking and avoid "Stop-The-World", we use a layer-wise update approach. After a layer’s forward pass completes and before its next forward pass starts, we perform the weight balancing for that layer; the next forward pass for the same layer should wait until the last update is done if it happens at this iteration.

As the forward execution is typically driven by a single Python thread invoking a sequence of PyTorch operations, we offload the weight update routine to a background C++ thread. The Python side only initializes the Expert Slots and registers Expert Weights in shared host memory.

During forward execution, we insert lightweight lock/unlock kernels before and after MoE computations, as well as kernels for collecting statistics and assigning SlotIds to ExpertIds. These kernels must be short and overlap-friendly to minimize performance impact. As long as the CPU weights update thread can finish its work on time, the lock/unlock will be very short. All, except for the routing kernel, are lightweight and can easily overlap with forward kernels in different CUDA streams; the routing kernel is the primary optimization focus.

On GB200, we utilize MNNVL for inter-GPU communication during Expert dispatch and combine. Expert weights reside in host memory and are brought into GPU memory via C2C to support asynchronous updates. A multi-threaded Host Copy Engine manages this process, auto-detecting NUMA topology and choosing optimal CPU cores, enabling full asynchrony with model forward passes.

On servers without C2C but with PCIe, if cross-node communication is required, network and weight updates may compete for PCIe bandwidth, requiring additional tuning and design consideration. We have not implemented the copy engine for PCIe servers yet and it is in list of future tasks.

### Offline EP Load Balancer

Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how how to run through the Offline EP Load Balancer in E2E approach.

## E2E evaluation

### The effect of EP Load Balancer

#### Offline EP Load Balancer
As shown by Figure 1, on the machine translation dataset, MoE layer 36 suffers from extreme expert load imbalance issues, so we use that layer to illustrate the effect of EP Load Balancer. We still run DeepSeek-R1 with 32-way expert parallelism on 32 GB200 GPUs.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture16.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 16: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (No EPLB)</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture17.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 17: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (No EPLB)</em></sub></p>

Figure 16 displays the routed token count by receiving ranks over 50 iterations, which could represent the workload for each rank. Rank 13 receives significantly more tokens than all other ranks, and such an imbalanced workload distribution is almost constant over iterations. Figure 17 breaks down the workload to experts. Clearly, two hot experts on rank 13 cause the excessive pressure on this rank.

With the above statistics, we can perform offline EPLB. One potential strategy is to maintain the 32-way expert parallelism while increasing expert slots from 8 to 9 per rank. This results in 32 redundant experts and 288 expert slots in total. Figures 18 and 19 show the routed token count after EPLB. Clearly, the per-rank token distribution is much more balanced, and there are no hot experts anymore.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture18.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 18: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (EPLB with 9 per-rank slots and EP 32)</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture19.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 19: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (EPLB with 9 per-rank slots and EP 32)</em></sub></p>

Another EPLB strategy is to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways. This strategy also results in 32 redundant experts and 288 expert slots in total. As displayed by Figures 20 and 21, the workloads also become balanced across ranks or expert slots.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture20.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 20: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (EPLB with 8 per-rank slots and EP 36)</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture21.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 21: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (EPLB with 8 per-rank slots and EP 36)</em></sub></p>

For each layer and iteration, the load imbalance can be measured using simple metrics such as the standard deviation or the imbalance ratio. Given the routed token counts for all ranks (or experts), the imbalance ratio is defined as $(max - mean) / mean$, which represents the excessive workload received by the hottest rank (or expert). A perfectly balanced load would have an imbalance ratio of 0.

Table 1 reports the standard deviation and imbalance ratio for the aforementioned cases. Each number is averaged from the per-layer per-iteration metrics. Without EPLB, the load imbalance is significant -- on average, the hottest rank receives 1.56 times more routed tokens than the mean. EPLB can effectively reduced the load imbalance -- on average, the hottest rank receives only about 0.11 times more routed tokens than the mean.

|  | By rank |  |  | By expert slot |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Average | Std. Dev. | Imb. Ratio | Average | Std. Dev. | Imb. Ratio |
| No EPLB (8 per-rank slots and EP 32) | 1024 | 491.6 | 1.564 | 128 | 164.1 | 10.948 |
| EPLB (9 per-rank slots and EP 32)    | 1024 |  52.0 | 0.109 | 114 | 77.8  |  1.792 |
| EPLB (8 per-rank slots and EP 36)    | 1024 |  53.9 | 0.115 | 128 | 87.5  |  1.791 |

*Table 1: The standard deviation and imbalance ratio (average of per-layer and per-iteration metrics)*

#### Online EP Load Balancer

In the previous section, we demonstrated the impact of the Offline EP Load Balancer. Given our implementation of the Online EP Load Balancer, we further examine the dynamic patterns of EP balancing in online conditions.
Let’s still use the machine translation dataset, DeepSeek R1 model,  layer 36 (which is shown in Figure 1) as the example to understand the online behaviour:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture22.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 22: The token count sent from rank 0 to all the ranks, run on GB200, with EP32, local batch size=256, with 256 slots(no replication), so each rank hosts 8 experts</em></sub></p>

From Figure 22, it is clear that from iteration 1963, since the EPLB has taken into effect, the original hottest rank 13 is no longer the hot rank and the original workload sent to rank 13 has been redistributed to rank 0 and rank 1.

In Figure 22, only placement adjustment has been done by the Online EPLB. If we further introduce expert replication, the balancing can be further improved, as shown on the following figure:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture23.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 23: The token count sent from rank 0 to all the ranks, run on GB200, with EP32, local batch size=256, with 288 slots(with replication), so each rank hosts 9 experts</em></sub></p>

Clearly, by introducing expert replication when doing the EPLB, the EP balancing can be further improved.
Further complicated experiments can be designed to observe the Online EPLB taking into effect periodically during the online serving process to balance the EP workload in a dynamic way and we welcome the community to report any interesting EPLB pattern observation to us.

### Performance study
Note: all the representative workloads illustrated in this section are from the performance traces extracted from DeepSeek R1 inference execution. The E2E performance tuning/optimization is still ongoing and we will discuss them in the future technical blogs.

Let's use some representative workloads to illustrate the performance impact with large-scale EP.
<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture24.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 24: EP impact over MoE Group GEMM and EP communication</em></sub></p>
In Figure 24, it can be observed that by increasing the EP size from 4 to 72, the MoE Group GEMM computation time gets reduced, while the EP communication time (for EP4/EP8 Reduce/Scatter is used, while for EP>8 All2All is used) stays almost constant.
When the EP size increases from 18 to 72, the speed-up diminishes. We are working on optimizing it.

Next, let's use some representative workloads to understand the performance impact with EPLB.
<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture25.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 25: EPLB performance impact</em></sub></p>
Clearly in Figure 25, we can see that EPLB brings a clear performance improvement when the EP size increases, for both MoE GroupGEMM and EP communication times.

## Reproducing steps
The code and scripts required in the reproducing steps described in this section have been merged to the main branch.

### The effect of EP Load Balancer

Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.

##### Step 1: Run inference and collect statistics

To generate the necessary statistics for load rebalancing, run your model on a target dataset and count the routed expert IDs during inference. Once the counting process is complete, the statistics will be saved for further processing.

Set up some environment variables:

```bash
export MODEL_NAME=deepseek-ai/DeepSeek-R1
export MODEL_PATH=<YOUR_MODEL_PATH>
# Set the expert statistic data path
export EXPERT_STATISTIC_PATH=./expert_statistic
# Enable counting of routed expert IDs from iteration 100 to iteration 200
export EXPERT_STATISTIC_ITER_RANGE=100-200
```

Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.

Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.

```bash
cat > ./config.yaml <<EOF
enable_attention_dp: true
EOF

trtllm-llmapi-launch \
trtllm-bench --model ${MODEL_NAME} \
    --model_path ${MODEL_PATH} \
    throughput \
    --tp 32 \
    --ep 32 \
    --config ./config.yaml \
    --kv_cache_free_gpu_mem_fraction 0.75 \
    --backend pytorch \
    --dataset ./dataset.json \
    --warmup 0 \
    --eos_id -1
```

After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:

```bash
python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
```

The output would look like:

```txt
Load statistics:
           mean         std  imbalance-ratio
3        1024.0  187.955200         0.498043
4        1024.0  202.728516         0.537602
5        1024.0  209.339981         0.458676
...
58       1024.0  570.880676         2.461014
59       1024.0  341.339447         0.717498
60       1024.0  381.045471         1.119648
average  1024.0  491.651199         1.564272
```

##### Step 2: Generate the EPLB configuration

Use the provided `examples/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.

```bash
python examples/ep_load_balancer/generate_eplb_config.py \
    --ep_size 36 \
    --num_slots 288 \
    --expert_statistic_path $EXPERT_STATISTIC_PATH \
    --output_path ./moe_load_balancer.yaml
```

The `./moe_load_balancer.yaml` file would look like:

```yaml
initial_global_assignments:
  3: [138, 81, 60, ..., 69, 250, 77]
  4: [24, 243, 72, ..., 90, 251, 52]
  5: [120, 162, 246, ..., 14, 192, 171]
  ...
  58: [67, 70, 160, ..., 212, 103, 125]
  59: [45, 142, 152, ..., 99, 205, 49]
  60: [34, 162, 119, ..., 234, 26, 129]
num_slots: 288
layer_updates_per_iter: 0
```

##### Step 3: Run inference with the EPLB configuration

Set up some environment variables:

```bash
# Set a new expert statistic data path
export EXPERT_STATISTIC_PATH=./expert_statistic_eplb
# Enable counting of routed expert IDs from iteration 100 to iteration 200
export EXPERT_STATISTIC_ITER_RANGE=100-200
```

Run 36-way expert parallelism inference with the EPLB configuration incorporated:

```bash
cat > ./config_eplb.yaml <<EOF
enable_attention_dp: true
moe_config:
  load_balancer: ./moe_load_balancer.yaml
EOF

trtllm-llmapi-launch \
trtllm-bench --model ${MODEL_NAME} \
    --model_path ${MODEL_PATH} \
    throughput \
    --tp 36 \
    --ep 36 \
    --config ./config_eplb.yaml \
    --kv_cache_free_gpu_mem_fraction 0.75 \
    --backend pytorch \
    --dataset ./dataset.json \
    --warmup 0 \
    --eos_id -1
```

Run the `examples/ep_load_balancer/report_load_statistics.py` script again:

```bash
python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
```

The output would look like:

```txt
Load statistics:
           mean        std  imbalance-ratio
3        1024.0  37.612328         0.081947
4        1024.0  42.367714         0.093256
5        1024.0  42.623219         0.092623
...
58       1024.0  49.167507         0.113420
59       1024.0  44.529514         0.092314
60       1024.0  48.408348         0.101029
average  1024.0  53.976442         0.115378
```

> **Note:** Counting expert IDs can significantly hurt performance, so remember to disable it by unsetting `EXPERT_STATISTIC_ITER_RANGE` when running inference for benchmarking or production purposes.

### Miscellaneous
- **GB200 NUMA binding**: Since on GB200, GPU memory are also on NUMA nodes, system can also use GPU's memory. It is suggested to use `numactl -m 0,1` to bind memory to CPU nodes if you don't want that happen.
- **Shared Memory Clean Up**: To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit. But if an abnormal exit happens, they may not get chance to be cleaned. In that case, you may need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any.

## Expanded thoughts

We deeply acknowledge the system innovation from the DeepSeek team. The introduction of the large-scale EP support into their in-house inference system and their open spirit of sharing their engineering insights with the community is extremely valuable and has already boost the performance of inference system design.
**Also we want to point out that there are no magical solutions when doing system design and optimization, such as large-scale EP.**
Based on our current performance analysis, when you plan to apply large-scale EP, you should take the following factors into considerations:

* Is the MoE GroupGEMM computation time an E2E performance bottleneck?
  * Large-scale EP mainly helps reduce the MoE GroupGEMM execution time by reducing expert weight loading pressure and, thus, increases the compute intensity of the MoE GroupGEMM layer. For your workload setting, if the MoE GroupGEMM computation is not the bottleneck, then large-scale EP may not help much.
* The latency constraints.
  * Large-scale EP mostly helps when there are strict latency constraints, especially on GB200/B200 with more memory capacity.  For GPUs with less memory capacity, for scenarios with less latency constraints, large-scale EP can still help as it helps achieve higher concurrency and better tokens/s/GPU.
* The available HW spec.
  * The optimal configuration for large-scale EP depends on GPU specifications \- including memory bandwidth, capacity, inter-GPU bandwidth, and compute power \- which determine both whether to employ large-scale EP and the ideal degree of parallelism.
* System complexity and the production deployment constraints.
  * Without fault tolerance guarantee, large-scale EP can increase the online system failure ratio. Even if it is possible to do cluster level coordination to route the traffic to other running serving instances when certain large-scale EP serving instances fail, the large number of GPUs required for a single-instance deployment of large-scale EP can increase system level deployment challenges.

**In the future, we plan to summarize and share more of the best practices of deploying with large-scale EP techniques.**

**Please use your own judgement to decide whether to use large-scale EP into your system or not, and when you use it, what is the suitable EP size and concrete deployment settings suitable for your own requirements.**

The current TensorRT LLM large-scale EP implementation is not perfect and there are still known limitations (community contributions are welcome to help us improve). For example, we need:

* More platforms coverage
  * Extending the support to cover other non-GB200 NVIDIA GPU HWs. **We are actively working on this now.**
  * Currently the large-EP support only covers NVFP4 data precision, incremental efforts are needed to cover FP8 and INT8/INT4 data precision.
* Performance
  * Further performance tuning and optimizations. **We are actively working on this now.**
  * More validation with workloads close to production traffic. **Here we highly welcome the community’s feedback to help us calibrate TensorRT LLM large-scale EP implementation based on more concrete workloads.**
  * The thorough validation of combination with other inference core features, such as dis-aggregated serving, speculative decoding, validation on more MoE model families, etc. **We are actively working on this now.**
* Ease-of-use
  * Easy customization
    * We believe large-scale EP can be decomposed into at least two layers:
      * A core layer which developed by inference engine developers. This layer contains the customized EP communication kernels, the synchronization logic between CPU and GPU, the MoE weight re-distributed logic.
      * A strategy layer which can be co-developed by the inference engine developers as well as machine learning researchers. This part contains tools to collect the online traffic statistics in different approaches, and algorithms for the optimal replication and placement of experts.
    * Based on this understanding, we plan to make components close to the strategy layer easier to be extended and customized by community users. We hope to encourage better ideas to emerge.
  * Based on user inputs of the deployment requirements (ISL/OSL, latency constraints, HW spec), we hope to be able to automatically recommend the best EP setting.
* Fault tolerance
  * Because large-scale EP deployment solution may lead to an increased fault ratio of the online deployment system, it may increase the need for cross-layer interactions with multiple components of the E2E LLM inference system on NVIDIA GPUs. This includes the low-level communication kernel, the cluster-level orchestrator and scheduler, etc. We are actively working with various NVIDIA engineering teams to push forward on this.


We believe the current implementation can be viewed as a reasonable E2E large-scale EP implementation and we encourage the community to try new ideas and performance validation. We encourage the community to share feedback to help us move fast in this area.  We are actively tracking the TensorRT LLM large-scale EP execution in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency to the community.


## Acknowledgement

The large-scale EP work is another great team effort, spanning kernel-level optimizations, runtime enhancements, and systematic performance analysis and tuning. While we cannot individually acknowledge every contributor, we are proud to recognize the dedicated team of engineers whose collective expertise has helped advance the state-of-the-art in terms of performance in TensorRT LLM .

Through this collaborative endeavor, we have developed valuable insights to allow us improve GPU utilization for large language model inference. We hope that the techniques and the experience shared in this blog will help the developer community to better leverage NVIDIA GPU capabilities in their mission-critical LLM inference applications.

---

# Disaggregated Serving in TensorRT LLM

By NVIDIA TensorRT LLM Team

- [Disaggregated Serving in TensorRT LLM](#disaggregated-serving-in-tensorrt-llm)
  - [Motivation](#motivation)
  - [Disaggregated Serving in TensorRT LLM](#disaggregated-serving-in-tensorrt-llm-1)
    - [trtllm-serve](#trtllm-serve)
    - [Dynamo](#dynamo)
    - [Triton Inference Server](#triton-inference-server)
  - [KV Cache Exchange](#kv-cache-exchange)
    - [Multi-backend Support](#multi-backend-support)
    - [Overlap Optimization](#overlap-optimization)
    - [Cache Layout Transformation](#cache-layout-transformation)
  - [Performance Studies](#performance-studies)
    - [Measurement Methodology](#measurement-methodology)
    - [DeepSeek R1](#deepseek-r1)
      - [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#isl-4400---osl-1200-machine-translation-dataset)
      - [ISL 8192 - OSL 256 (Synthetic Dataset)](#isl-8192---osl-256-synthetic-dataset)
      - [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#isl-4096---osl-1024-machine-translation-dataset)
    - [Qwen 3](#qwen-3)
      - [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#isl-8192---osl-1024-machine-translation-dataset)
    - [Reproducing Steps](#reproducing-steps)
  - [Future Work](#future-work)
  - [Acknowledgement](#acknowledgement)

In the past tech blogs, we have introduced optimization specifically for [low-latency](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) and [throughput](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) oriented optimizations. For production deployment, users also care about per GPU throughput satisfying certain latency constraints. In this tech blog, we will introduce the design concept and usage of the TensorRT LLM disaggregated serving which directly targets throughput@latency performance scenarios, together with performance study results.

## Motivation

LLM inference has two stages: context (prefill) and generation (decode) phases. The context phase computes KV cache for prompt tokens whereas the generation phase generates tokens one by one using cached values. These phases have different compute characteristics.

There are two ways of serving LLM inference requests:

* Aggregated LLM serving (sometimes it is also called IFB in this tech blog), in which the context and generation phases are run on the same GPU.
* Disaggregated LLM serving, in which the context and generation phases are run on different GPUs.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture1.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. The execution timeline of aggregated LLM serving</em></sub></p>

In aggregated LLM serving, both the context and generation phases share the same GPU resources and parallelism strategy. This can lead to interference where context processing delays token generation, increasing token-to-token latency (TPOT) and reducing interactivity. This is illustrated in Figure 1 which shows the execution timeline for aggregated LLM serving. Aggregated LLM serving also forces a single GPU type and parallelism configuration for both phases, even though their compute needs differ. As a result, optimizing for one metric such as time-to-first-token (TTFT), often comes at the expense of another metric such as TPOT.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture2.png" width="580" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 2. The execution timeline of dis-aggregated LLM serving</em></sub></p>

Disaggregated serving resolves these challenges by decoupling the two phases, allowing each to run on separate GPU pools and using different parallelism strategies. This separation removes the interference between context and generation phases, as shown in Figure 2, and enables independent optimization of TTFT and TPOT. Although disaggregation incurs overhead for transferring the KV cache blocks from context to generation GPUs, the advantages can be substantial—particularly for workloads with long input sequences and moderate output lengths where interference is most severe.

You can also refer to [this paper](https://arxiv.org/pdf/2506.05508) for more details about the rational and design considerations of disaggregated serving.

## Disaggregated Serving in TensorRT LLM

There are three different approaches to do disaggregation LLM inference with TensorRT LLM, where each approach offers distinct architectural and operational characteristics suited to different deployment scenarios.

### trtllm-serve

[`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html) is a command-line utility that facilitates the deployment of an OpenAI-compatible server for TensorRT LLM instances.

The first approach to do disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 3 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 3). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture3.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 3. `trtllm-serve` integration with disaggregated service</em></sub></p>

In the example below, two context servers are launched on ports 8001 and 8002, and two generation servers are launched on ports 8003 and 8004. Finally, a disaggregated server is also launched using `trtllm-serve`. The disaggregated server will receive client requests on port 8000, and do the orchestration between the context and generation servers.

```shell
# Launching context servers
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --kv_cache_free_gpu_memory_fraction 0.15 &> output_ctx0 &
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --kv_cache_free_gpu_memory_fraction 0.15 &> output_ctx1 &

# Launching generation servers
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --kv_cache_free_gpu_memory_fraction 0.15 &> output_gen0 &
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8004 --kv_cache_free_gpu_memory_fraction 0.15 &> output_gen1 &

# Launching disaggregated server
trtllm-serve disaggregated -c disagg_config.yaml
```

```yaml
# disagg_config.yaml
hostname: localhost
port: 8000
context_servers:
  num_instances: 2
  router:
    type: round_robin
  urls:
    - "localhost:8001"
    - "localhost:8002"
generation_servers:
  num_instances: 2
  urls:
    - "localhost:8003"
    - "localhost:8004"
```

The disaggregated server supports various load balancing strategies, including round-robin and KV cache-aware routing. Although it currently supports a fixed number of context and generation instances, the architecture is designed to be extensible, and efforts are underway to enable dynamic scaling.

For more information on this approach to do disaggregated serving, please refer to [the example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated#trt-llm-disaggregated-serving).

### Dynamo

The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 4.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 4. Dynamo integration with disaggregated service</em></sub></p>

In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.

Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.

For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).

### Triton Inference Server

The third approach to do disaggregated LLM inference with TensorRT LLM utilizes the Triton Inference Server. With this approach a Triton ensemble model is employed, comprising a preprocessor, an orchestrator implemented as [a Python business logic scripting (BLS) backend](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/bls.html), and a post-processor. The orchestrator is responsible for routing client requests to context and generation instances, managing the flow of prompt tokens, and handling the return of generated tokens. This approach is illustrated in Figure 5. The Triton Inference Server approach relies on the Triton TensorRT LLM backend and the Executor API, which is supported only for the TensorRT backend. For more information on how to use this approach, please refer to [this documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/main/triton_backend/all_models/disaggregated_serving#running-disaggregated-serving-with-triton-tensorrt-llm-backend).

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture5.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 5. Triton integration with disaggregated service</em></sub></p>

## KV Cache Exchange

### Multi-backend Support

In TensorRT LLM, the KV cache exchange is modularly decoupled from the KV cache manager and the underlying communication libraries, as shown in Figure 6. The KV cache exchange module is responsible for efficient transmission and reception of the cache, promptly releasing cache space, and performing cache layout conversions during the exchange process. Currently, mainstream communication protocols—MPI, UCX, and NIXL—are all supported by TensorRT LLM, and the underlying communication protocols utilize RDMA / NVLink. Currently, we recommend using UCX and NIXL backends, as we are adding a dynamic scaling mechanism on top of them—specifically, dynamic node joining and leaving. This allows customers to adjust the load based on traffic demands or switch roles between context and generation dynamically.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture6.png" width="890" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 6. KV cache exchange architecture</em></sub></p>

### Overlap Optimization

To optimize the overall performance of disaggregated serving, TensorRT LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 7. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture7.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 7. KV cache exchange timing diagram</em></sub></p>

### Cache Layout Transformation

To minimize KV cache transmission latency, TensorRT LLM currently uses direct transmission between device memories for cache transfer. The KV cache transmission supports using different parallel strategies for the context and generation phases. In such cases, careful orchestration of KV cache block mapping is required. Figure 8 illustrates this using the example of context phase with TP2 and generation phase with PP2.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture8.png" width="680" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 8. KV cache layout conversion</em></sub></p>

The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT LLM provides a set of environment variables for selection in different environments. Please refer to [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disagg-serving.md) for details.

## Performance Studies

### Measurement Methodology

Generating a performance curve for disaggregated LLM serving requires an exhaustive sweep across all parallelization strategies. This includes combinations of TP/EP/DP/PP and other optimizations like speculative decoding (such as MTP). These combinations must be evaluated separately for context and generation stages. As the number of context (CTX) and generation (GEN) servers increases, the number of possible configurations grows exponentially.

To identify optimal configurations, a two step process is used:

* Rate Matching
  * Measure request throughput (request/s/GPU) for context servers for different TP/EP/DP/PP mapping that meet the TTFT constraint, choose the most efficient configuration.
  * Measure total throughput (tok/s) and latency (tok/s/user) for generation servers from different TP/EP/DP/PP mappings, concurrency levels and speculative decoding turned on/off.
  * Find the ratio of context to generation workers such that aggregated throughput of context servers matches the aggregated throughput of generation servers for the workload’s input sequence length (ISL) and output sequence length (OSL)
  * Calculate the throughput per GPU using the formula:
  $\frac{\text{Total Output Tokens/sec}}{\left(\frac{\text{NumCtxGPUs} \times \text{GenReqRate}}{\text{CtxReqRate}}\right) + \text{NumGenGPUs}}$

  * Once the ideal ratio of context to generation servers is computed, the “rate-matched” Pareto curve can be constructed to identify the best configuration to use at different latencies (tok/s/user)

* E2E measurement
  * Benchmark `trtllm-serve` disaggregated setups for the most promising configurations taking into account practical limits in terms of total number of GPUs available.

### DeepSeek R1

We conducted performance testing on DeepSeek R1 based on datasets with different ISLs and OSLs. All experiments below were conducted on GB200 GPUs.

#### ISL 4400 - OSL 1200 (Machine Translation Dataset)

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture9.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 9. “Rate-matched” Pareto curve for DeepSeek R1 without MTP</em></sub></p>

Figure 9 shows the rate-matched Pareto curve for DeepSeek R1 with MTP off. Configurations with attention DP and attention TP were considered, with 4, 8, 16 or 32 GPUs per instance. The speedups obtained with disaggregation range from **1.4x** to **1.8x**, especially at lower concurrency levels.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture10.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 10. DeepSeek R1 with MTP Pareto curve</em></sub></p>

For some data points on the performance curve, the context/generation instance number is shown with the corresponding parallelism mapping employed for each instance. For example, `CTX=1xTEP-4|GEN=2xDEP-8` means 1 TEP4 context instance and 2 DEP8 generation instances constitute a full LLM serving instance.

As shown in Figure 10, enabling MTP increases speedups of disaggregation over aggregation further, reaching 1.6x to 2.5x, averaging 20 – 30 % higher than MTP-off.

#### ISL 8192 - OSL 256 (Synthetic Dataset)

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture11.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 11. DeepSeek R1 4-GPU Pareto curve. ctx/gen=4.5 means SOL rate matching between context and generation phase, which is only used for SOL perf result collection purpose. c4dep4_g1dep4 means 4 DEP4 context instances plus 1 DEP4 generation instance form a full LLM serving instance.</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture12.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 12. DeepSeek R1 8-GPU Pareto curve</em></sub></p>

Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per generation instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments).

The results show that for this ISL/OSL setting, disaggregated serving outperforms aggregated serving significantly—achieving up to **1.73x** speedup with GEN4 and up to **2x** with GEN8.

By comparing the disaggregated serving E2E results with the “rate-matched” curve, we observe a performance gap of 0–25%. This discrepancy is expected, as SOL performance relies on idealized assumptions—such as fractional ctx:gen ratios and the absence of KV cache transfer overhead.

#### ISL 4096 - OSL 1024 (Machine Translation Dataset)

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture13.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 13. DeepSeek R1 E2E Pareto curves with MTP = 1, 2, 3. In this figure, ctx1dep4-gen2dep4-mtp3 means 1 DEP4 context instance plus 2 DEP4 generation instances with MTP = 3.</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture14.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 14. DeepSeek R1 E2E Pareto curves without MTP.</em></sub></p>

In Figure 13 and 14, the E2E Pareto curves for aggregated serving and disaggregated serving, with and without MTP are shown.

For Pareto curves with MTP = 1, 2, 3, it can be observed that disaggregated results show a **1.7x** improvement over aggregated results at 50 tokens/sec/user (20 ms latency). Enabling MTP provides a larger speedup at higher concurrencies.

### Qwen 3

#### ISL 8192 - OSL 1024 (Machine Translation Dataset)

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture15.png" width="640" height="auto" alt="Qwen 3 Pareto curves">
</figure>
</div>
<p align="center"><sub><em>Figure 15. Qwen 3 Pareto curves.</em></sub></p>

We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indicate that the speedups achieved by disaggregation over aggregation range from 1.7x to 6.11x.

### Reproducing Steps

We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm/benchmark).

## Future Work

Although we can already demonstrate the performance benefits of doing disaggregated LLM inference with TensorRT LLM, there is still work to be done to further improve the performance and ease of use. Among other things, we plan to:

* Provide detailed steps and scripts to automate the generation of throughput-latency performance curves comparing aggregated with disaggregated.
* Continue to improve performance at larger scales (large-scale EP for example).
* Support dynamic scaling of context and generation instances based on traffic load.
* Support overlapping KV cache communication and compute on a per-layer basis.

## Acknowledgement

Adding support for disaggregated serving in TensorRT LLM is a typical one-team effort requiring close collaboration spanning kernel-level optimizations, runtime enhancements, and systematic performance analysis and tuning. While we cannot individually acknowledge every contributor, we are proud to recognize the dedicated team of engineers whose collective expertise has helped advance the state-of-the-art in terms of performance in TensorRT LLM. Through this collaborative endeavor, we have developed valuable insights to allow us to improve GPU utilization for large language model inference. We hope that the techniques and the experience shared in this blog will help the developer community better leverage NVIDIA GPU capabilities in their mission-critical LLM inference applications.

---

# How to launch Llama4 Maverick + Eagle3 TensorRT LLM server

Artificial Analysis has benchmarked the Llama4 Maverick with Eagle3 enabled TensorRT LLM server running at over [1000 tokens per second per user on 8xB200 GPUs](https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/). This implementation leverages NVIDIA's TensorRT LLM combined with speculative decoding using the Eagle3 model to further boost performance.

In the guide below, we will walk you through how to launch your own high-performance Llama4 Maverick with Eagle3 enabled TensorRT LLM server, from build to deployment.  (Note that your specific performance numbers may vary—speculative decoding speedups depend upon the dataset!)

## Prerequisites

- 8x NVIDIA B200 GPUs in a single node (we have a forthcoming guide for getting great performance on H100)
- CUDA Toolkit 12.8 or later
- Docker with NVIDIA Container Toolkit installed
- Fast SSD storage for model weights
- Access to Llama4 Maverick and Eagle3 model checkpoints
- A love of speed

## Download Artifacts

* [NVIDIA Llama 4 Maverick 17B 128E Instruct FP8](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8)
* [NVIDIA Llama 4 Maverick 17B 128E Eagle3 BF16](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3)

In [Step 4: Start the TensorRT LLM server](#step-4-start-the-tensorrt-llm-server), `/path/to/maverick` and `/path/to/eagle` refer to the download paths of the above respective models.

## Launching the server

### Step 1: Clone the repository

```
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
```

The last command, `git lfs pull`, ensures all large files stored with Git LFS are properly downloaded. If `git lfs` is not installed, please install following [Install Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)

### Step 2: Prepare the TensorRT LLM release Docker image


#### Option 1. Use weekly release NGC docker image
TensorRT LLM provides weekly release [docker image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release)

#### Option 2. Build TensorRT LLM Docker image (Alternative way)
If you want to compile a specific TensorRT LLM commit, you can build the docker image by checking out the specific branch or commit and running a make command. This may take 15-30 minutes depending on your system.

```
make -C docker release_build
```

### Step 3: (Optional) Tag and push the Docker image to your registry

If you want to use this image on multiple machines or in a cluster:

```
docker tag tensorrt_llm/release:latest docker.io/<username>/tensorrt_llm:main
docker push docker.io/<username>/tensorrt_llm:main
```

Replace `<username>` with your Docker Hub username or your private registry path.

### Step 4: Start the TensorRT LLM server

This command launches the server with Llama4 Maverick as the main model and Eagle3 as the draft model for speculative decoding. Make sure you have downloaded both model checkpoints before running this command.

**Important:** Replace `/path/to/maverick` and `/path/to/eagle` with the actual paths to your Maverick and Eagle3 model checkpoints on your host machine, downloaded in the [Download Artifacts](#download-artifacts) stage

```
docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
    -v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
    docker.io/<username>/tensorrt_llm:main sh \
        -c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\n  eagle3_one_model: true\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
        TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
        trtllm-serve /config/models/maverick \
            --host 0.0.0.0 --port 8000 \
            --tp_size 8 --ep_size 1 \
            --trust_remote_code --config c.yaml \
            --kv_cache_free_gpu_memory_fraction 0.75"
```

This command:
- Runs the container in detached mode (`-d`)
- Sets up shared memory and stack limits for optimal performance
- Maps port 8000 from the container to your host
- Enables all GPUs with tensor parallelism across all 8 GPUs
- Creates a configuration file for speculative decoding with Eagle3
- Configures memory settings for optimal throughput

After running this command, the server will initialize, which may take several minutes as it loads and optimizes the models.

You can query the health/readiness of the server using
```
curl -s -o /dev/null -w "%{http_code}" "http://localhost:8000/health"
```

When the 200 code is returned the server is ready for queries.  Note that the very first query may take longer due to initialization and compilation.

### Step 5: Test the server with a sample request

Once the server is running, you can test it with a simple curl request:

```
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "Llama4-eagle",
        "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
        "max_tokens": 1024
    }' -w "\n"

# {"id":"chatcmpl-e752184d1181494c940579c007ab2c5f","object":"chat.completion","created":1748018634,"model":"Llama4-eagle","choices":[{"index":0,"message":{"role":"assistant","content":"NVIDIA is considered a great company for several reasons:\n\n1. **Innovative Technology**: NVIDIA is a leader in the development of graphics processing units (GPUs) and high-performance computing hardware. Their GPUs are used in a wide range of applications, from gaming and professional visualization to artificial intelligence (AI), deep learning, and autonomous vehicles.\n2. ...","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":552,"completion_tokens":535}}
```

The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.


### Step 6: (Optional) Monitor server logs

To view the logs of the running container:

```
docker ps # get the container id
docker logs -f <container_id>
```

This is useful for troubleshooting or monitoring performance statistics reported by the server.

### Step 7: (Optional) Stop the server

When you're done with the server:

```
docker ps # get the container id
docker kill <container_id>
```

## Troubleshooting Tips

- If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`
- Ensure your model checkpoints are compatible with the expected format
- For performance issues, check GPU utilization with `nvidia-smi` while the server is running
- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
- For connection issues, make sure port 8000 is not being used by another application

## Performance Tuning

The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload.

**Note:** This configuration is optimized for minimum latency (`enable_min_latency: true`). When increasing the concurrency of requests, the tokens per second (TPS) per user degrades rapidly. This setup is designed to maximize single-user performance rather than high-concurrency throughput. For workloads with many concurrent users, you may need to adjust the configuration accordingly.

- `max_batch_size`: Controls how many requests can be batched together
- `max_draft_len`: The number of tokens Eagle can speculate ahead
- `kv_cache_free_gpu_memory_fraction`: Controls memory allocation for the KV cache

---

# N-Gram Speculative Decoding in TensorRT LLM
N-Gram speculative decoding leverages the natural repetition in many LLM workloads. It splits previously seen text into configurable (key, value) n‑gram pairs and, during generation, swiftly proposes draft tokens by matching the current key against n-gram pools in memory.

In this blog, we introduce design choices in TensorRT‑LLM’s N-Gram speculative decoding algorithm, share our experimental results of performance gains, and explain N-Gram's low barrier to adoption by deriving a simple heuristic to enable it.

## Highlights
* **Fast & lightweight.** N‑Gram algorithm runs on the host with low overhead.
* **Real speed‑ups at low concurrency.** N-Gram achieves accepted length of 1.37 and more on average running on the Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered dataset ([link](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered/viewer/default/train)) with the first round of conversation. Results in 10-60% E2E runtime speed-up.
* **Works even better with multi-turn conversations.** With the cache built up during the first round of conversation, the second round achieved a higher accepted length of 1.66 and a 30–90% E2E runtime speed-up.
* **Excels on tasks with natural repetition like translation.** With the translation dataset, the accepted length can exceed 4.0. New requests can benefit from cache generated by previous requests with similar tasks and reduce latency by up to 70%.
* **Heuristic “just works”.** Set `spec_decode_algo=AUTO` to enable N‑Gram by default.
  * This policy adds less than 15% overhead to iteration latency yet offers nets double‑digit end‑to‑end speed‑ups.

---

## Table of Contents
- [Background & Motivation](#background--motivation)
- [Algorithm & Complexity](#algorithm--complexity)
- [Performance Study](#experimental-setup)
    - [Experimental Setup](#experimental-setup)
    - [Case 1 with Conversation Dataset ](#case-1-with-conversation-dataset)
        - [Speed-up for the First Turn](#speed-up-for-the-first-turn)
        - [Effect of Multi-turn conversation](#effect-of-multi-turn-conversation)
    - [Case 2 with Translation Dataset](#case-2-with-translation-dataset)
- [Auto‑Enablement with Heuristic](#autoenablement-with-heuristic)
- [Feature Gaps](#featuregaps)

---


## Background & Motivation
Speculative decoding drafts several tokens, verifies them on the model, and keeps the accepted prefix at each iteration of the generation loop. An N‑Gram proposer can generate drafts without an extra LLM or model heads, making it a low-cost way to improve serving latency. Average accepted length (AL) is ~1.3 in generic chat (MT‑Bench, Magpie with the first round of conversation) and can exceed 4.0 on highly repetitive data like a translation task.

---


## Algorithm & Complexity
`NGramDecodingConfig` in TensorRT LLM:
```python
spec_config = NGramDecodingConfig(
    max_draft_len = v ,             # max length of draft tokens
    max_matching_ngram_size  = k ,  # max length for keys
    is_keep_all   = True,           # Whether to keep all candidate pattern-matches pairs, only one match is kept for each pattern if False.
    is_use_oldest = True,           # Whether to provide the oldest match when pattern is hit, the newest one is provided if False.
    is_public_pool= True,           # Whether to use a common pool for all requests, or the pool is private for each request if False.
)
```
* **Processing New Request** ‑ scan input sequence once to create N-Gram key-value pairs for the new sequence.

    With *max_matching_ngram_size = 3, max_draft_len = 5, input_sequence_len=8*, Figure 1 shows the 18 new key-value pairs added to the cache pool.

    The number of cache pairs grows proportionally to the product of the maximum key length and the input sequence length.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_init_sequence_scan.png" width="auto" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 1. Request initial scan</em></sub></p>

* **Per‑token update** ‑ slide window and update cache pool

    We now have a new token in the sequence. Figure 2 shows how the cache pool is updated accordingly. For existing key-value pairs whose value length is less than the `max_draft_len`, the new token can be appended. The new token can be the value to new keys as well, which are marked as new pairs in the graph.

    The number of cache update and addition is approximately the product of `max_draft_len` and `max_matching_ngram_size`, which is a constant for fixed parameters.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_per_token_update.png" width="auto" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 2. Per-token update</em></sub></p>

* **Lookup** ‑ construct the last k tokens as the key and propose draft tokens as its value.

    If `is_public_pool= True`, a global pool is shared by all the requests. If `is_public_pool= False`, each request will have its own cache pool.

    The lookup time is amortized constant time, but extra latency can be observed once the dictionary outgrows the CPU’s fastest cache.

* **Verification** ‑ Verify proposed draft tokens.

    Run the target model with `verification_batch =  original_batch × (v+1)`; There will always be at least one new token from verification even if no draft token is correct. In this case, the accepted length (AL) will be `1`. In addition, if `w` out of the `v` draft tokens are accepted, the accepted length (AL) will be `w+1`.

    The iteration latency grows as the verification batch becomes larger than the original batch. As we increase `max_draft_len (v)`, the overhead grows even more. Therefore, speculative decoding tends to work best with small batch sizes and low concurrency.

---

## Performance Study

### Experimental Setup
* **Hardware:** 8 × B200 GPUs (Blackwell)
* **Model:** Llama‑4‑Scout‑17B‑16E, FP8 weights
* **Tensor Parallel:** 8

---

### Case 1 with Conversation Dataset

In this experiment, we used Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered dataset ([link](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered/viewer/default/train)) which is a conversational dataset with two turns. The user question on the second turn is related to the previous question and answer.

The first turn only data represents a general conversation with no context. The repetition comes from the conversational structure and correlation between the question and answers.

On the second turn, the global cache already has the knowledge of the previous conversation. The additional repetitions come from the correlation between the second answer and previous conversation.

#### Speed-up for the First Turn
For batch size of 1, 4 and 32, we configure the max_batch_size of the model accordingly. We will run `20 * batch_size` number of requests with the model and compare the E2E runtime with and without N-Gram speculative decoding.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_speed_up_first_turn.png" width="80%" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 3. First Turn Speed-up</em></sub></p>

We can see that N-Gram can provide speed-ups for batch sizes up to 32 and works best with a single batch. The main overhead with larger batch sizes is the verification cost. With batch size being 1 and 4, `k = 3, v = 5` is the best N-Gram configuration. With batch size = 32, `k = 5, v = 3` is the best configuration since the verification batch size is smaller and the overhead is less.


#### Effect of Multi-turn conversation
The table below shows the accepted length (AL) derived from 3000 sampled conversations using different N-Gram configurations.
| k | v | AL Turn1 | AL Turn2 |
|---|---|-------|-------|
| 3 | 5 | 1.37 | 1.66 |
| 5 | 5 | 1.40 | 1.77 |
| 5 | 3 | 1.37 | 1.66 |

Figure 4 shows the distribution of accepted length (AL) with `k=3, v=5`. When `AL=1`, it means none of the draft tokens are accepted. AL=6 means all the drafts are accepted.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_magpie_accepted_length_distribution.png" width="90%" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 4. Accepted draft token length distribution</em></sub></p>

In Figure 5, for each iteration, we plot the average of accepted length (AL) for each request. Transparency is calculated according to the number of requests scheduled on that iteration and normalized by the max capacity among all iterations. If fewer requests are scheduled, the dot is more transparent.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_al_over_iteration_magpie.png" width="auto" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 5. AL over iteration</em></sub></p>

Figure 6 shows the speed-up with N-Gram speculative decoding for the second turn of conversation only.
N-Gram with `k = 3, v = 5` delivers 96.13% of speed-up with single batch and 63.99% of speed-up with batch size 4. With batch size 32 and N-Gram `k = 5, v = 3`, the speed up is 33.06%.
<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_speed_up_second_turn.png" width="80%" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 6. Second Turn Speed-up</em></sub></p>

We can draw the conclusion that:

**N-Gram speculative decoding improves the runtime of conversational workloads, especially when the conversation has multiple rounds.**

---


### Case 2 with Translation Dataset
From the conversational dataset, we learned that N-Gram takes advantage of structural repetition. In the second case study, we unleash the potential of N-Gram by testing it with a translation dataset that exhibits natural repetition in both context and language. The dataset has a single turn, with prompts in English asking for translations into other languages.

The table below shows the accepted length (AL) measured with 4000 requests. AL grows with increasing `max_draft_len (v)` and the trend extends beyond `max_draft_len (v) = 23` in our measurements.

|              | 1    | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   | 11   | 12   | 13   |14   |
|--------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| k  | 3    | 5    | 3    | 5    | 3    | 5    | 3    | 5    | 3    | 5    | 5    | 5    | 5    | 5    |
| v  | 7    | 7    | 9    | 9    | 11   | 11   | 13   | 13   | 15   | 15   | 17   | 19   | 21   | 23   |
| AL | 3.44 | 3.62 | 3.708| 3.925| 3.878| 4.092| 4.079| 4.214| 4.198| 4.36 | 4.43 | 4.55 | 4.59 | 4.73 |


Figure 7 shows properties of accepted length with N-Gram configured with k = 5, v = 7.

From the pie chart on the left, among the seven draft tokens proposed by N-Gram, roughly one-third of the cases accept none of the drafts, which correspond to `AL=1`, while another one-third accept all of them, which correspond to `AL=8`. Compared with the similar pie chart in Case 1 Figure 4, the ratio is very high. The graph on the right plots the accepted length at each iteration with five random requests.

<div align="center">
  <figure>
    <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_accepted_length_case2.png" width="auto" height="auto">
  </figure>
</div>
<p align="center"><sub><em>Figure 7. Accepted Tokens from Drafts</em></sub></p>

##  Auto‑Enablement with Heuristic
A big part of N-Gram's appeal is the simplicity of deployment. It does not need a carefully selected draft model or additional training of model heads to benefit from speculative decoding. It can be enabled by the serving software to take advantage of the strong performance of the N-Gram speculative decoding algorithm.

From our experiments, we propose a simple batch-aware policy that keeps iteration overhead under control and yields ~15 % end-to-end speed-up at low to mid concurrency. Give it a try by setting `spec_decode_algo=AUTO`!

---

# Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)

This blog post continues our previous work on [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md), where we introduced the fundamental design and implementation of large-scale Expert Parallelism (EP) in TensorRT LLM. Building upon that foundation, we have made significant performance improvements through various optimizations, achieving better throughput and latency for large-scale MoE models.

*By NVIDIA TensorRT LLM Team*

## Table of Contents
- [Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)](#scaling-expert-parallelism-in-tensorrt-llm-part-2-performance-status-and-optimization)
  - [Table of Contents](#table-of-contents)
  - [Optimization Highlights](#optimization-highlights)
    - [Kernel Optimizations](#kernel-optimizations)
      - [MoE Auxiliary Kernels](#moe-auxiliary-kernels)
      - [Communication Kernels](#communication-kernels)
    - [Expert Parallelism Load Balancer (EPLB)](#expert-parallelism-load-balancer-eplb)
      - [Attempts at Online EPLB Implementation](#attempts-at-online-eplb-implementation)
        - [1. Initial Approach for Weight Updating - cudaMemcpyAsync](#1-initial-approach-for-weight-updating---cudamemcpyasync)
        - [2. Avoiding Deadlock - Multithreaded CPU Copy with Managed Memory](#2-avoiding-deadlock---multithreaded-cpu-copy-with-managed-memory)
        - [3. NUMA Memory to Prevent Page Migration](#3-numa-memory-to-prevent-page-migration)
        - [4. Addressing the TLB Thrashing Issue](#4-addressing-the-tlb-thrashing-issue)
    - [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
    - [Host Overhead Optimization](#host-overhead-optimization)
      - [Reduce Binding and Inter-Process Communication Overhead](#reduce-binding-and-inter-process-communication-overhead)
      - [Support Stream Interval](#support-stream-interval)
  - [End-to-End Performance](#end-to-end-performance)
  - [Future Work](#future-work)
    - [Further Performance Optimization](#further-performance-optimization)
  - [Acknowledgements](#acknowledgements)

## Optimization Highlights

Following the introduction of the fundamental design and implementation of large-scale Expert Parallelism (EP) in TensorRT LLM in our [previous blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md), the TensorRT LLM team has focused on optimizing the large EP implementation to improve performance.

At the kernel level, we analyzed kernel duration and optimized performance by either improving existing kernels or developing new kernels that perform better. At the system level, we refined and optimized the EPLB implementation (which also helps reduce kernel scalability issues), integrated additional features such as MTP, and optimized host overhead to prevent Python code from slowing down inference.

### Kernel Optimizations

Our initial kernel breakdown and analysis revealed several key observations about performance impacts when Expert Parallelism (EP) scales up:

1. **MoE GEMM duration decreases** as EP size increases, which is expected behavior.
2. **Attention kernel performance** remains unaffected by increased EP size, demonstrating good scalability.
3. **Communication and some MoE kernels** do not scale well and require optimization.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_kernel_breakdown.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 1: Kernel breakdown when scaling EP without EPLB.</em></sub></p>

We have made improvements to the MoE auxiliary kernels, including `expandInputRowsKernel`, `doActivationKernel`, and `finalizeMoeRoutingKernel`, and to the communication kernels by replacing `AllGather` with a newly developed `AllToAllPrepare` kernel. Additionally, since the `ReduceScatter` and `AlltoAll` kernels do not scale well due to EP imbalance, we optimized the EPLB implementation to improve the scalability of those kernels.

#### MoE Auxiliary Kernels

We observed that given a fixed per-GPU batch size, `expandInputRowsKernel`, `doActivationKernel`, and `finalizeMoeRoutingKernel` showed increased execution time with larger EP size. However, their workload should remain constant regardless of EP size.

Before MoE group GEMMs, `M` tokens are expanded to `M * topK` tokens, which are routed to experts hosted on different ranks. Hence, on average only `M * topK / EP` expanded tokens are valid on each rank (those routed to experts hosted on that rank). The original kernels launch a thread block for each expanded token. Each thread block detects if the token is valid; if so, it proceeds with the computation; otherwise, the thread block exits. For a large EP size, the valid tokens are sparse (`1 / EP`), so most thread blocks are launched for invalid tokens and do nothing, which is wasteful.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_moe_aux_kernels1.png" width="400">
</figure>
</div>
<p align="center"><sub><em>Figure 2: Sparsity of valid expanded tokens. For DeepSeek-R1 deployed with EP 32, a batch of 12 tokens are expanded to 96 tokens, but only 3 are valid on rank 0.</em></sub></p>

Therefore, we modified the kernels so that thread blocks are launched for valid tokens only. This addressed the scalability issue.

Note that the number of valid tokens is data-dependent. To guarantee CUDA graph compatibility, we cannot rely on any data-dependent information on the host. Thus, we further modified the kernels to use persistent thread blocks, which control the loop based on the valid token number on the device.

This optimization was implemented in [PR 5215](https://github.com/NVIDIA/TensorRT-LLM/pull/5215), with the following performance improvement:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_moe_aux_kernels2.png">
</figure>
</div>
<p align="center"><sub><em>Figure 3: Optimization effect on MoE auxiliary kernels. (Left) Before optimization, kernel time increases with EP size. (Right) After optimization, kernel time remains constant with EP size.</em></sub></p>

#### Communication Kernels

As introduced in our [previous blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#ep-communication-kernels-implementation), we developed EP communication kernels to transfer hidden state tensors of MoE. In the original design, each rank needs to determine which tokens it needs to send and receive, along with the expert IDs and scaling factors selected by those tokens. We initially used `allgather` to collect expert IDs and scaling factors, then each rank calculated the required metadata. However, we found that although the transmission size of this data is not large, the performance of `allgather` is unsatisfactory and may become a performance bottleneck when EP size increases. Therefore, we developed new communication kernels to optimize this process.

First, a kernel counts the number of tokens needed to be transferred to another rank and transfers the count to that rank. Then each rank can calculate the index information for subsequent alltoall kernels. Finally, an alltoall kernel transfers expert IDs and scaling factors. These kernels make EP more scalable because the communication size no longer increases with EP size. The implementation of the communication part of these kernels is similar to the previous communication kernel of hidden states, are used in a FIFO manner. But an important difference is that these kernels use release-acquire instructions to ensure memory consistency, which has the advantage of being able to support various forms of data more flexibly. Although it is not as efficient as LL128 primitive in terms of performance, it is more helpful for fast iteration before the functionality converges.

Note that although these kernels achieve better performance compared to `allgather`, there is still considerable room for optimization, especially in latency-bound scenarios.

This optimization was implemented in [PR 5570](https://github.com/NVIDIA/TensorRT-LLM/pull/5570), with the following performance improvement:

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_communication_kernel.png">
</figure>
</div>
<p align="center"><sub><em>Figure 4: Optimization effect on communication kernels.</em></sub></p>

### Expert Parallelism Load Balancer (EPLB)

As introduced in our [previous blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#ep-load-balancer), EP-level workload imbalance is common for large-scale EP inference across multiple datasets and has significant performance impacts. TensorRT LLM implements a set of functionalities to address this issue. We have refined the code and improved the usability of this feature, and the benefits of EPLB are directly reflected in kernel duration improvements.

The core challenge with EP scaling is that different experts receive varying amounts of work based on the routing decisions made by the MoE layer. This imbalance becomes more pronounced as EP size increases, leading to scenarios where some GPUs are heavily loaded while others remain underutilized. The Expert Parallelism Load Balancer (EPLB) addresses this by dynamically redistributing expert assignments to achieve better load balance across all participating GPUs.

EPLB operates in two main modes:
- **Static EPLB**: Pre-computed expert-to-GPU mappings based on historical data patterns
- **Online EPLB**: Dynamic runtime redistribution that adapts to real-time workload patterns

While Static EPLB provides good baseline improvements, Online EPLB offers the potential for optimal load balancing by responding to actual runtime patterns. However, implementing Online EPLB presented several unexpected technical challenges, particularly around weight synchronization and memory management in GPU clusters.

In the previous [Kernel Optimizations](#kernel-optimizations) section, we noted that `reduce_scatter` and `alltoall` kernels do not show good scalability, with load imbalance being the major root cause. After applying proper EPLB strategy, those kernels perform well even when EP size scales to larger extents.

#### Attempts at Online EPLB Implementation

We discussed the [high-level design](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#high-level-design-introduction) and [implementation considerations](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#online-ep-load-balancer) of Online EPLB in our previous blog. However, several unexpected issues arose during implementation.

These issues primarily stem from the weight updating mechanism.

##### 1. Initial Approach for Weight Updating - cudaMemcpyAsync

Our initial approach for weight updating was straightforward. Since GPU kernels from the model forward thread read weights, we placed weights directly in GPU memory using `cudaMalloc` and used a separate non-blocking stream to invoke multiple `cudaMemcpyAsync` calls for weight updates. After implementing the first version of the prototype, we discovered that with CUDA Graph enabled, the model forward thread and the weight updating thread could deadlock.

After investigation, we found the root cause: both `cudaGraphLaunch` and `cudaMemcpyAsync` were competing for the same mutex inside CUDA. In our implementation with layer-wise weight updating, the GPU needs to synchronize with the CPU during model forward passes. This creates kernels that wait for CPU signals indicating that updates are complete and MoE weights are safe to use. These waiting kernels block subsequent kernels.

Since LLM models contain numerous kernels, `cudaGraphLaunch` may need to wait for previous kernels to finish to acquire sufficient resources for launch completion. When waiting kernels are blocked by the CPU, `cudaGraphLaunch` is also blocked. The CPU thread responsible for unblocking this process is the weight update thread, which should signal completion when weight updating finishes. However, since our initial implementation used `cudaMemcpyAsync` for weight updating, it needed to acquire the CUDA mutex before starting memcpy operations. Unfortunately, this mutex was held by `cudaGraphLaunch` in the model forward thread, which was waiting for the weight updating thread to complete. This created a deadlock scenario.

To resolve the deadlock, we needed to break the dependency cycle. While the model forward thread must depend on the weight updating thread for correctness, the weight updating process should not wait for `cudaGraphLaunch` in the model forward thread. Our solution was to use alternative methods instead of `cudaMemcpyAsync` to avoid competing for the same mutex with `cudaGraphLaunch` and other CUDA APIs.

##### 2. Avoiding Deadlock - Multithreaded CPU Copy with Managed Memory

Since weight updating is handled by CPU threads and we wanted to avoid interfering with GPU model forward passes while avoiding mutex contention in `cudaMemcpyAsync`, we chose to use CPU threads for copying operations. To achieve this, we needed MoE weights to be accessible by the CPU while remaining physically located on the GPU to provide high bandwidth for MoE forward passes.

On GB200 systems, the C2C link between CPU and GPU allows CPU access to GPU memory, with GPU memory treated as NUMA nodes. Although the CUDA Driver API doesn't directly support this in CUDA 12.9, one option is to use `cudaMallocManaged` for MoE weights and use `cudaMemAdvise` to set the GPU as the preferred location while enabling CPU access. The CPU copy implementation was straightforward, but we still needed to detect system topology and bind to CPU cores belonging to the same NUMA nodes as the GPU's host NUMA node.

After completing this implementation, CUDA Graph worked well with weight updating and we began seeing end-to-end performance benefits using Online EPLB in some configurations. However, we soon encountered issues with managed memory. Although the preferred location of managed memory was set to GPU, and on GB200 it typically remains on GPU when accessed by CPU, we still observed page migration when GPU memory usage approached capacity limits. The bottom half of the UVM interrupt service process for each GPU consumed 100% of one CPU core's time, causing severe slowdowns when approaching GPU memory limits. To address this, we needed GPU memory that was accessible by CPU without triggering page migration.

##### 3. NUMA Memory to Prevent Page Migration

On GB200 systems, the Grace CPU and Blackwell GPU are connected via C2C links, enabling mutual memory access. GPU memories are also exposed to the OS as NUMA nodes. Running `numactl -H` on GB200 nodes shows output similar to this:

```text
# numactl -H
available: 34 nodes (0-33)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 489935 MB
node 0 free: 370318 MB
node 1 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 1 size: 489795 MB
node 1 free: 465004 MB
node 2 cpus:
node 2 size: 188416 MB
node 2 free: 188415 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
...
node 9 cpus:
node 9 size: 0 MB
node 9 free: 0 MB
node 10 cpus:
node 10 size: 188416 MB
node 10 free: 188416 MB
...
node 18 cpus:
node 18 size: 188416 MB
node 18 free: 188416 MB
...
node 26 cpus:
node 26 size: 188416 MB
node 26 free: 188416 MB
...
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33
  0:  10  40  80  80  80  80  80  80  80  80  80  80  80  80  80  80  80  80  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120
  1:  40  10  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  80  80  80  80  80  80  80  80  80  80  80  80  80  80  80  80
  2:  80  120  10  11  11  11  11  11  11  11  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40
  3:  80  120  11  10  11  11  11  11  11  11  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40
...
  9:  80  120  11  11  11  11  11  11  11  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40
...
```

In this configuration, `node 0` and `node 1` are Grace CPU nodes, each with 72 CPU cores and 480GB of memory. `node 2`, `node 10`, `node 18`, and `node 26` represent NVIDIA GB200 GPUs, which have no CPU cores but contain memory. Additional NUMA nodes (3-9, 11-17, 19-25, 27-33) are reserved for MIG instances and show 0 MB memory size. For brevity, we only show `node 3` and `node 9` in the example.

It's possible to allocate system memory on a GPU's NUMA node using `numa_alloc_onnode` (e.g., NUMA node 2 for GPU 0), then register that memory with the GPU using `cudaHostRegister` to make it accessible as host system memory. This allows both CPU and GPU to access the memory, and our testing showed that bandwidth appears nearly identical to normal device memory from the GPU's perspective.

This approach resolved page migration issues, and Online EPLB worked well for large batch sizes per GPU (e.g., 256). However, when investigating smaller batch sizes (32 or 64), we found that MoE GEMM kernel execution time could be higher than without Online EPLB—increasing from 75 µs to 93 µs for the first group GEMM of MoE with EP size 16. Further experiments revealed that when running group GEMM multiple times in the same layer, only the first execution suffered from this slowdown. By adding a warmup kernel that read only one value from 64 KB of weights, we found this simple warmup kernel consumed more than half the execution time of the group GEMM kernel. More interestingly, when running this warmup kernel in parallel with other kernels (using only 14 CTAs), those other kernels also became extremely slow. Based on these observations, we concluded that we were encountering TLB thrashing.

##### 4. Addressing the TLB Thrashing Issue

On GB200 systems, the default page size is 64 KB, which can be verified with:

```text
# getconf PAGE_SIZE
65536
```

The `numa_alloc_onnode` function may use this page size, which is too small for efficient GPU kernel execution. Linux systems support [HugeTLB Pages](https://docs.kernel.org/admin-guide/mm/hugetlbpage.html), and on GB200 systems, the huge page size is 512 MB:

```text
# cat /proc/meminfo
MemTotal:       1774995776 kB
MemFree:        1651165696 kB
MemAvailable:   1671517696 kB
...
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:     524288 kB
Hugetlb:               0 kB
```

By using huge pages, we can significantly reduce the number of required TLB entries and avoid TLB thrashing. Our implementation approach:

- Use `mmap` to allocate address space aligned to 512 MB boundaries
- Use `mbind` to bind the memory to the GPU's NUMA node (e.g., NUMA node 2 for GPU 0)
- Request huge pages using `madvise` with the `MADV_HUGEPAGE` flag
- Register the memory with the GPU using `cudaHostRegister`

This approach provides memory that is located on the GPU, accessible by the host, uses large pages instead of small ones, and doesn't trigger page migration. One consideration is that huge page allocation requires memory allocation at the granularity of one page (512 MB), which could cause significant memory waste with separate allocations. Since our primary use case involves MoE weights that are allocated at model load time and persist throughout the model's lifetime, we implemented a simple memory pool to minimize waste.

Since our implementation relies on huge pages and `madvise`, Transparent Hugepages must be enabled on the system. Without this, you may encounter the exception `madvise(MADV_HUGEPAGE) failed.`. To verify that Transparent Hugepages is properly configured:

```bash
>$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
```

In the output above, the value in square brackets indicates the current setting. If `never` is highlighted instead of `madvise`, you can enable Transparent HugePages with:

```bash
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
```

After implementing huge pages, we found that warmup kernels now execute in only 4 µs without slowing down other kernels. Additionally, group GEMM kernel performance matches that achieved without Online EPLB, both with and without warmup operations. This optimization was implemented in [PR 5963](https://github.com/NVIDIA/TensorRT-LLM/pull/5963), and we achieved additional performance improvements using Online EPLB on the Pareto curve.

### Multi-Token Prediction (MTP)

MTP allows verifying and accepting several draft tokens in a single iteration, which is very beneficial for scenarios that prefer low latency. TensorRT LLM has supported MTP, and we refer to our previous [MTP blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md#mtp-implementation-in-tensorrt-llm) for more details on the implementation.

For large EP, we have also extended the implementation so that it works well with online EPLB. This was implemented in [PR 5213](https://github.com/NVIDIA/TensorRT-LLM/pull/5213).

### Host Overhead Optimization

Since large-scale EP enables extensive parallelism that includes both expert parallelism and attention data parallelism, the total batch size of one iteration scales with the number of total GPUs involved in the calculation. One outcome is that this significantly increases the number of requests and responses that the system must handle, putting huge pressure on Python threads. The Global Interpreter Lock (GIL) makes the situation worse, since multi-threading won't help under heavy system workloads. When the workload prefers higher throughput, it could even appear that highly optimized CUDA kernels are faster than CPU operation execution, and the GPU could be idle waiting for the CPU to finish the work.

To address the increased host overhead when scaling parallelism in the system, we added optimizations to performance hot spots to reduce single-thread pressure.

#### Reduce Binding and Inter-Process Communication Overhead

TensorRT LLM is designed to be composed of both C++ and Python code, so that C++ can handle the most performance-sensitive parts while Python handles higher-level logic. As we try to put more logic into Python to make the program easier to read and debug, there are still frequent conversations through binding interfaces between C++ and Python. Besides, since most of the logic is implemented in Python, there are several layers of implementation that communicate with each other through inter-process communication overhead. Frequent binding calls and serialization/deserialization introduced by inter-process communication slow down the core library.

To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).

To enable powerful NVTX markers for easier analysis of host overheads, TensorRT LLM provides several useful environment variables:

```bash
export TLLM_NVTX_DEBUG=1 # enables more NVTX markers
export TLLM_PROFILE_RECORD_GC=1 # enables GC collection hint
export TLLM_PROFILE_START_STOP=100-150 # enable specific iterations profiling
```

#### Support Stream Interval

As mentioned previously, one outcome of large-scale workloads is that they significantly increase the number of requests and responses that the system must handle, putting huge pressure on Python threads. When the GPU finishes one iteration of calculation, a batch of responses are generated under streaming mode. For each response, TensorRT LLM must perform detokenization so that output IDs are converted to strings, and OpenAI API protocol objects need to be initialized so that responses can be returned to the user. This becomes time-consuming, especially when the number of responses is huge and the CPU must process them on each iteration. One observation from the user side will be reduced streaming performance when compared to non-streaming.

To address this problem, TensorRT LLM has supported a feature called stream interval. Instead of handling all responses on each iteration, a user-specified `stream_interval` `N` indicates that responses will be handled and returned every `N` iterations. This way, on each iteration, there will still be one output ID generated, but it won't be returned to users immediately (except for the first token for the sake of time-to-first-token latency). Instead, tokens accumulate for `N` iterations, and one response is created to handle those `N` generated tokens, which greatly reduces pressure on the CPU side by giving more time for the CPU to catch up. Meanwhile, users can still get streamed output.

This feature was added in [PR 5284](https://github.com/NVIDIA/TensorRT-LLM/pull/5284), and we have verified that it works effectively to reduce host overhead. In most cases, setting `stream_interval` to 2 or 4 should close the gap (if any) between streaming and non-streaming modes. The feature can be enabled by setting the following in the YAML extra config file:

```yaml
stream_interval: 4
```

## End-to-End Performance

To demonstrate the benefits of large-scale EP, we compared performance on EP16 and EP32 with EP4 and EP8 as baselines, on GB200 NVL72 using DeepSeek R1 FP4 [checkpoints](https://huggingface.co/nvidia/DeepSeek-R1-FP4).

We explored different workloads including 1k-ISL 1k-OSL, 4k-ISL 1k-OSL, and 8k-ISL 1k-OSL. To quickly collect these data points and ensure that generation nodes are saturated, we used the `TLLM_BENCHMARK_REQ_QUEUES_SIZE` environment variable when benchmarking so that the workload can quickly reach a balanced point. The numbers are measured on commit `0cf2f6f154b4a5765d89945b20aa3449b2be7933` with a translation-task dataset, and generated by post-processing the per-iteration log.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-1k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 5: DeepSeek R1 throughput on ISL/OSL 1k/1k.</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-4k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 6: DeepSeek R1 throughput on ISL/OSL 4k/1k.</em></sub></p>

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-8k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 7: DeepSeek R1 throughput on ISL/OSL 8k/1k.</em></sub></p>

When enabling MTP, there is an extra performance boost compared to the baseline. We conducted end-to-end experiments and compared to EP4 and EP8 as baselines, seeing up to 6.17x per-GPU output throughput improvement. The numbers are measured with `trtllm-serve` enabling multiple features like large EP, disaggregated serving, EPLB, MTP, and using an OpenAI API client [tool](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) that sends requests to the server and collects performance metrics.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-8k-1k-e2e-mtp.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 8: DeepSeek R1 throughput on ISL/OSL 8k/1k with MTP enabled.</em></sub></p>

To reproduce the numbers, refer to the [`examples/wide_ep/slurm_scripts`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) directory. The scripts there demonstrate how to launch TensorRT LLM disaggregated serving with large-scale EP and other features enabled on a SLURM cluster.

## Future Work

### Further Performance Optimization

We are planning to implement more performance optimizations for the large EP implementation, including optimizing the `concat_qkv` operation for the context phase, quantizing `Wo_GEMM` to FP4, supporting low-precision `All2All` operations, and fusing some `All2All` kernels into one. We will also explore integrating more features such as PDL.

## Acknowledgements

This work represents an outstanding example of collaborative engineering excellence within the TensorRT LLM team. The successful implementation and optimization of large-scale Expert Parallelism required coordinated efforts across multiple domains - from low-level CUDA kernel optimizations to high-level system architecture design. The dedication and technical expertise demonstrated by our team members throughout this project has been truly remarkable.

Large-scale Expert Parallelism represents one of the important workloads for users productive scenarios, enabling efficient deployment of large MoE models. The performance improvements achieved through this work demonstrate the transformative potential of expert parallelism at scale, and this work opens new possibilities for deploying increasingly sophisticated AI models in production environments.

---

# Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM

In the guide below, we will walk you through how to launch your own
high-performance TensorRT LLM server for **gpt-oss-120b** for inference.
This guide covers both low-latency and max-throughput cases.

**Low-latency** use cases aim to maximize the number of tokens per second per user (tps/user) with limited concurrency.

For **max-throughput**, the goal is to maximize the tokens produced per GPU per second (tps/gpu). While tps/user indicates user experience quality, tps/gpu measures the economic efficiency of the system.

## Prerequisites

- 1x NVIDIA B200/GB200/H200 GPU (more GPUs could be used for lower latency and higher throughput)
- Fast SSD storage for model weights
- Access to the gpt-oss-120b model checkpoint

We have a forthcoming guide for getting great performance on H100, however this guide focuses on the above GPUs.


## Launching the TensorRT LLM docker container

The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`

Run the follow docker command to start the TensorRT LLM container in interactive mode:

```bash
docker run --rm --ipc=host -it \
  --ulimit stack=67108864 \
  --ulimit memlock=-1 \
  --gpus all \
  -p 8000:8000 \
  -e TRTLLM_ENABLE_PDL=1 \
  -v ~/.cache:/root/.cache:rw \
  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1 \
  /bin/bash
```


Explanation of the command:
- Automatically removes the container when stopped (`--rm`)
- Allows container to interact with the host's IPC resources and shared memory for optimal performance (`--ipc=host`)
- Runs the container in interactive mode (`-it`)
- Sets up shared memory and stack limits for optimal performance
- Maps port 8000 from the container to your host
- enables PDL for low-latency perf optimization
- disables parallel weight loading

Lastly the container mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container.


## Running the TensorRT LLM Server

As pointed out in the introduction, this guide covers low-latency and max-throughput cases. Each requires a different configurations and commands to run. We will first cover the Low-Latency use-case, followed by the max throughput use-case.

### Low-latency Use-Case

#### Creating the Extra Options Configuration

Create a YAML configuration file, `low_latency.yaml`, as follows:

```bash
cat <<EOF > low_latency.yaml
enable_attention_dp: false
cuda_graph_config:
    max_batch_size: ${max_batch_size}
    enable_padding: true
moe_config:
    backend: TRTLLM
EOF
```

> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200 Only) Using OpenAI Triton Kernels for MoE](#h200-only-using-openai-triton-kernels-for-moe) for more details.


#### Launching TensorRT LLM Serve

To launch the TensorRT LLM Server to serve the model with the **low latency** config, run the following command. Commands for different GPU configurations are provided (1xGPU, 8xGPU, 4xGPU):

<details open> <summary>1x B200/GB200/H200</summary>

```bash
trtllm-bench \
    --model openai/gpt-oss-120b \
    --model_path ${local_model_path} \
    throughput \
    --backend pytorch \
    --tp ${num_gpus} \
    --ep 1 \
    --config low_latency.yaml \
    --dataset gpt-oss-120b-1k2k.txt \
    --max_batch_size ${max_batch_size} \
    --concurrency ${max_batch_size} \
    --num_requests $((max_batch_size * 10)) \
    --kv_cache_free_gpu_mem_fraction 0.9 \
    --streaming \
    --warmup 0 \
    --report_json low_latency_benchmark.json
```

`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requests.

Note that you can set `--ep` to a value larger than 1, which will enable mixed TP/EP for MoE. In minimum-latency scenarios, we recommend a small EP size to avoid load imbalance in MoE.

For reference, we achieve **420 tps/user** with 8x B200 GPUs and batch size 1.


### Max-Throughput Use Case

The max-throughput configuration maximizes tps/gpu at high concurrency levels. With increasing concurrency, we trade per-user latency for higher throughput that saturates the system's GPUs. Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we can currently achieve a batch size of 640 with 8x B200 GPUs.

```bash
num_gpus=8
max_batch_size=640
```


#### Creating the Extra Options Configuration

Like before, create a YAML configuration file, `max_throughput.yaml`, as follows:

```bash
cat <<EOF > max_throughput.yaml
enable_attention_dp: true
cuda_graph_config:
    max_batch_size: ${max_batch_size}
    enable_padding: true
stream_interval: 10
moe_config:
    backend: CUTLASS
EOF
```

Compared to the low-latency configuration, we:
- set `enable_attention_dp` to `true` to use attention DP which is better for high throughput.
- set `stream_interval` to 10 to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` serves as a workaround to reduce this overhead.
- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput.

#### Launching TensorRT LLM Serve

To launch the TensorRT LLM Server to serve the model with the **max throughput** config, run the following command. Commands for different GPU configurations are provided (1xGPU, 8xGPU, 4xGPU):

<details open> <summary>1x B200/GB200/H200</summary>

```bash
trtllm-bench \
    --model openai/gpt-oss-120b \
    --model_path ${local_model_path} \
    throughput \
    --backend pytorch \
    --tp ${num_gpus} \
    --ep ${num_gpus} \
    --config max_throughput.yaml \
    --dataset gpt-oss-120b-1k2k.txt \
    --max_batch_size ${max_batch_size} \
    --concurrency $((max_batch_size * num_gpus)) \
    --num_requests $((max_batch_size * num_gpus * 3)) \
    --kv_cache_free_gpu_mem_fraction 0.9 \
    --streaming \
    --warmup 0 \
    --report_json max_throughput_benchmark.json
```

Note:
- `CUTLASS` MoE backend only supports pure EP for MoE, so we set `--ep` to `num_gpus`.
- When using `enable_attention_dp`, `max_batch_size` describes the maximum batch size for each local rank, so to saturate the system, we need to multiply `max_batch_size` by `num_gpus` for `--concurrency`.
- `--num_requests` is set to 3 times `--concurrency` to run enough number of requests.

Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4x B200 GPUs and over **20k tps/gpu** on GB200 GPUs due to slightly better performance of GB200, which translates to over **1.5M tps** on a GB200 NVL72 system. In theory, even better tps/gpu could be achieved with larger world size due to larger allowable batch size and smaller MoE weights per-GPU, but the communication implementation for >4GPUs is suboptimal and we are actively working on improving it.


## Launch the TensorRT-LLM Server

We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
**Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).

```bash
mpirun -n 1 --oversubscribe --allow-run-as-root \
trtllm-serve  openai/gpt-oss-120b \
  --host 0.0.0.0 \
  --port 8000 \
  --backend pytorch \
  --tp_size 8 \
  --ep_size 8 \
  --max_batch_size 640 \
  --trust_remote_code \
  --config max_throughput.yaml \
  --kv_cache_free_gpu_memory_fraction 0.9
```
</details>

<details> <summary>4x GB200/B200/H200</summary>

```bash
trtllm-serve \
  openai/gpt-oss-120b \
  --host 0.0.0.0 \
  --port 8000 \
  --backend pytorch \
  --tp_size 4 \
  --ep_size 4 \
  --max_batch_size 640 \
  --trust_remote_code \
  --config max_throughput.yaml \
  --kv_cache_free_gpu_memory_fraction 0.9
```
</details>


This command:
- Maps port 8000 from the container to your host
- Uses the PyTorch backend and specifies the tensor and expert parallel sizes
- References the low latency or max throughput configuration file for extra options
- Configures memory settings for optimal performance
- Enables all GPUs with attention data parallelism for the max throughput scenario

The initialization may take several minutes as it loads and optimizes the models.


## (H200 Only) Using OpenAI Triton Kernels for MoE

OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT LLM can leverage these kernels for Hopper based GPUs like NVIDIA's H200 for best performance. The NGC TensorRT LLM container image mentioned above already includes the required kernels so you do not need to build or install them. It is highly recommended to enable them with the steps below:

### Selecting Triton as the MoE backend

To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--config`:

```yaml
moe_config:
  backend: TRITON
```

Alternatively the TRITON backend can be enabled by passing the CLI flag to the trtllm-server command at runtime:

```bash
--moe_backend TRITON
```


## Test the Server with a Sample Request


To check the server's health and readiness:

```bash
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the
very first query may take longer due to initialization and compilation.

Once the server is running, you can test it with a simple curl request:

```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
        {
            "role": "user",
            "content": "What is NVIDIAs advantage for inference?"
        }
    ],
    "max_tokens": 1024,
    "top_p": 0.9
}' -w "\n"
```

<details><summary><b>Show Example Output</b></summary>

```bash
{
  "id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42",
  "object": "chat.completion",
  "created": 1754358426,
  "model": "openai/gpt-oss-120b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" The user wants:
         \"What is NVIDIA's advantage for inference?\" Likely they want a detailed answer about why NVIDIA has advantages for
          inference tasks (e.g., GPU acceleration, TensorRT, software stack, Tensor Cores, hardware, performance, low latency,
          etc.). The user wants a short answer? Not specified. The user wants a straightforward answer. Probably a brief
          explanation: hardware features like Tensor cores, optimized libraries (TensorRT), ecosystem, software stack,
          integration with frameworks, cuDNN, inference GPU architecture, high throughput, low latency, FP16, INT8, etc.\n\nWe
          need to produce a response: Provide NVIDIA's advantage for inference. Provide specifics: hardware (Tensor cores, RT
          cores, Ampere architecture), software (TensorRT, CUDA, cuDNN, Triton Inference Server), developer ecosystem, network
          effects, etc.\n\nAlso mention inference performance vs competition, DPUs, etc.\n\nWe'll answer concisely but thorough.
          \n\nWe also must not mention we are ChatGPT or mention policy. Provide final answer.\n\nLet's produce a clear,
          well-structured answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>**NVIDIA’s edge in AI inference comes
          from a tightly‑integrated stack of hardware, software, and ecosystem tools that together deliver the
          highest‑throughput, lowest‑latency execution of deep‑learning models.**\n\n| **Layer** | **Key Advantages** | **Why
          It Matters for Inference** |\n|-----------|-------------------|---------------------------------|\n| **GPU
          Architecture** | • **Tensor Cores** (FP16/FP8/INT8) – dedicated matrix‑multiply units that give 2–8× higher ops/W
          than classic CUDA cores. <br>• **CUDA SMs with mixed‑precision support** – can run FP16/INT8/INT4 with minimal
          accuracy loss. <br>• **NVLink / NVSwitch** – high‑bandwidth, low‑latency inter‑GPU communication for multi‑GPU
          serving. | Maximizes raw arithmetic density, letting a single GPU handle many concurrent requests. |\n|
          **Inference‑focused Silicon** | • **A100 / H100 “Tensor‑Core” GPUs** – FP8 support (H100) and sparsity acceleration
          (up to 2× throughput). <br>• **T4/Ti, L40, L4, Jetson edge modules** – power‑optimized variants for data‑center,
          edge, and robotics. | Provides the right performance‑per‑watt for cloud, on‑prem, or edge deployments. |\n|
          **Software Stack** | • **TensorRT** – a compiler & runtime that fuses layers, applies precision calibration, and
          auto‑tunes kernels for each GPU. <br>• **CUDA, cuDNN, cuBLAS** – low‑level libraries tuned for every generation of
          GPU. <br>• **Triton Inference Server** – model‑agnostic serving, model‑versioning, batching, GPU sharing, and
          scaling across CPUs/GPUs. | Turns raw GPU power into production‑ready, low‑latency services with minimal engineering
          effort. |\n| **Model Optimizations** | • **Quantization (INT8/FP8) & Structured Sparsity** – supported natively by
          TensorRT and the hardware. <br>• **Automatic Mixed‑Precision (AMP)** – retains accuracy while cutting compute. |
          Reduces memory bandwidth and compute cost while keeping accuracy within acceptable bounds. |\n| **Ecosystem &
          Compatibility** | • **Broad framework support** – TensorFlow, PyTorch, ONNX, JAX, etc., all compile to TensorRT.
          <br>• **NVIDIA NGC** – pre‑optimized model zoo, containers, and reference pipelines. <br>• **MLOps tools** –
          NVIDIA Merlin, Clara, Metropolis, etc., for recommendation, medical, vision pipelines. | Engineers can
          plug‑and‑play, accelerate, and ship models faster. |\n| **Scalability & Deployment Flexibility** | • **DGX Cloud,
          EGX, Jetson, and Orin** – end‑to‑end solutions from cloud to edge. <br>• **Multi‑Instance GPU (MIG)** – partition
          a single A100 into up to 7 isolated inference instances. <br>• **NVIDIA AI Enterprise** – managed software suite
          for on‑prem data‑centers. | Allows the same code to run on a laptop, an edge device, or a massive data‑center
          cluster. |\n| **Performance Benchmarks** | • **Industry‑leading latency/throughput** on MLPerf Inference (FP8,
          INT8). <br>• **Sparsity‑aware kernels** give >2× speedup on H100 with < 0.1 % accuracy loss. | Demonstrates
          real‑world advantage in the most respected benchmark suite. |\n|",
        "reasoning_content": null,
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "disaggregated_params": null
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 1041,
    "completion_tokens": 1024
  },
  "prompt_token_ids": null
}

```
</details>

The server exposes a standard OpenAI-compatible API endpoint that accepts JSON
requests. You can adjust parameters like `max_tokens`, `temperature`, and
others according to your needs.


## (H200/H100 Only) Using OpenAI Triton Kernels for MoE

OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.

### Selecting Triton as the MoE backend

To use the Triton MoE backend with **trtllm-serve** (or other commands), add this snippet to the YAML file passed via `--config`:

```yaml
moe_config:
  backend: TRITON
```


## Troubleshooting Tips

- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
- Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log.
- Check GPU utilization with `nvidia-smi` while the server is running to inspect GPU status and memory usage.
- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
- For connection issues, make sure port 8000 is not being used by another application

---

# Run benchmarking with `trtllm-serve`

TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).

```{contents}
:Contents
:local:
:depth: 3
```


## Methodology Introduction

The overall performance benchmarking involves:
   1. Launch the OpenAI-compatible service with `trtllm-serve`
   2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

## Preparation

### Launch the NGC container

TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).

You can launch the container using the following command:

```bash
docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorrt-llm/release:x.y.z
```


### Start the trtllm-serve service
> [!WARNING]
> The commands and configurations presented in this document are for illustrative purposes only.
> They serve as examples and may not deliver the optimal performance for your specific use case.
> Users are encouraged to tune the parameters based on their hardware and workload.
For benchmarking purposes, first create a bash script using the following code and name it start.sh.
```bash
#! /bin/bash
model_path=/path/to/llama3.1_70B
config_file=/tmp/config.yml

cat << EOF > ${config_file}
enable_attention_dp: false
print_iter_log: true
cuda_graph_config:
  enable_padding: true
  max_batch_size: 1024
kv_cache_config:
  dtype: fp8
EOF

trtllm-serve ${model_path} \
    --max_batch_size 1024 \
    --max_num_tokens 2048 \
    --max_seq_len 1024 \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --tp_size 1 \
    --ep_size 1 \
    --trust_remote_code \
    --config ${config_file}
```
> [!NOTE]
> The trtllm-llmapi-launch is a script that launches the LLM-API code on
> Slurm-like systems, and can support multi-node and multi-GPU setups.
> e.g, trtllm-llmapi-launch trtllm-serve .....

Run the start.sh script in the **background** with the following command:

```bash
bash -x start.sh &
```

Once the serving is set up, it will generate the output log as shown below.
```bash
INFO:     Started server process [80833]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
```

## Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`

Similar to starting `trtllm-serve`, create a script to execute the benchmark using the following code and name it bench.sh.

```bash
concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/llama3.1_output
model_path=/path/to/llama3.1_70B

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model ${model_path} \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --save-result \
        --result-dir "${result_dir}" \
        --result-filename "concurrency_${concurrency}.json" \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
```

Then we can run the benchmark using the command below.

```bash
bash -x bench.sh &> output_bench.log
```

Below is some example TensorRT LLM serving benchmark output. Your actual results may vary.

```
============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  1.64
Total input tokens:                      1024
Total generated tokens:                  1024
Request throughput (req/s):              0.61
Output token throughput (tok/s):         622.56
Total Token throughput (tok/s):          1245.12
User throughput (tok/s):                 623.08
Mean Request AR:                         0.9980
Median Request AR:                       0.9980
---------------Time to First Token----------------
Mean TTFT (ms):                          12.83
Median TTFT (ms):                        12.83
P99 TTFT (ms):                           12.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.59
Median TPOT (ms):                        1.59
P99 TPOT (ms):                           1.59
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.59
Median ITL (ms):                         1.59
P99 ITL (ms):                            1.77
----------------End-to-end Latency----------------
Mean E2EL (ms):                          1643.44
Median E2EL (ms):                        1643.44
P99 E2EL (ms):                           1643.44
==================================================
```

### Key Metrics

#### Time to First Token (TTFT)
  * The typical time elapsed from when a request is sent until the first output token is generated.

#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
  * TPOT is the typical time required to generate each token *after* the first one.
  * ITL is the typical time delay between the completion of one token and the completion of the next.
  * Both TPOT and ITL ignore TTFT.

For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:

$$
\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
$$

Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):

$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
$$

$$
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
$$

#### End-to-End (E2E) Latency
  * The typical total time from when a request is submitted until the final token of the response is received.

#### Total Token Throughput
  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

$$
\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

#### Tokens Per Second (TPS) or Output Token Throughput
  * how many output tokens the system generates each second.

$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

### Request Time Breakdown

To get more detailed metrics besides the key metrics above, there is an [experimental tool](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve/scripts/time_breakdown) for request time breakdown.

## About `--config`

```{eval-rst}
.. include:: ../../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

`trtllm-serve` provides `--config` to **overwrite** the parameters specified by `trtllm-serve`.
Generally, we create a YAML file that contains various performance switches. For example:

```yaml
cuda_graph_config:
  padding_enabled: true
print_iter_log: true
kv_cache_dtype: fp8
enable_attention_dp: true
```

The following is a list of common performance switches.
#### `kv_cache_config`

&emsp;**Description**: A section for configuring the Key-Value (KV) cache.

&emsp;**Options**:

&emsp;&emsp;dtype: Sets the data type for the KV cache.

&emsp;&emsp;**Default**: auto (uses the data type specified in the model checkpoint).

#### `cuda_graph_config`

&emsp;**Description**: A section for configuring CUDA graphs to optimize performance.

&emsp;**Options**:

&emsp;&emsp;enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.

&emsp;&emsp;**Default**: false

&emsp;&emsp;max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.

&emsp;&emsp;**Default**: 0

&emsp;&emsp;**Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.

&emsp;&emsp;batch\_sizes: A specific list of batch sizes to create CUDA graphs for.

&emsp;&emsp;**Default**: None

#### `moe_config`

&emsp;**Description**: Configuration for Mixture-of-Experts (MoE) models.

&emsp;**Options**:

&emsp;&emsp;backend: The backend to use for MoE operations.

&emsp;&emsp;**Default**: CUTLASS

#### `attention_backend`

&emsp;**Description**: The backend to use for attention calculations.

&emsp;**Default**: TRTLLM

See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `--config`.

## Multimodal Serving and Benchmarking

TensorRT-LLM supports multimodal models for both serving and benchmarking. This section covers how to set up multimodal serving and run benchmarks for multimodal models.

### Setting up Multimodal Serving

Here's an example of setting up multimodal serving with Qwen2.5-VL:

```bash
#!/bin/bash
model_path=/path/to/qwen2.5vl-7B_model

trtllm-serve ${model_path} \
    --max_batch_size 64 \
    --max_num_tokens 8192 \
    --max_seq_len 4096 \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --tp_size 1 \
    --ep_size 1 \
    --trust_remote_code
```

### Multimodal Benchmarking

For multimodal serving benchmarks, you can use the `benchmark_serving.py` script with multimodal datasets:

```bash
python -m tensorrt_llm.serve.scripts.benchmark_serving \
    --model ${model_path} \
    --backend openai-chat \
    --dataset-name "random_image" \
    --random-input-len 128 \
    --random-output-len 128 \
    --random-image-width 512 \
    --random-image-height 512 \
    --random-num-images 1 \
    --num-prompts 100 \
    --max-concurrency 8 \
    --ignore-eos
```

Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
```
============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  0.83
Total input tokens:                      128
Total generated tokens:                  128
Request throughput (req/s):              1.20
Output token throughput (tok/s):         153.92
Total Token throughput (tok/s):          307.85
User throughput (tok/s):                 154.15
Mean Request AR:                         0.9845
Median Request AR:                       0.9845
---------------Time to First Token----------------
Mean TTFT (ms):                          84.03
Median TTFT (ms):                        84.03
P99 TTFT (ms):                           84.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.88
Median TPOT (ms):                        5.88
P99 TPOT (ms):                           5.88
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.83
Median ITL (ms):                         5.88
P99 ITL (ms):                            6.14
==================================================
```

**Notes for Multimodal Benchmarking:**
- Set `--backend` as `openai-chat` since multimodal models are only supported on the chat API and require a chat template
- Control the number of images per request with `--random-num-images`
- Use `--random-image-width` and `--random-image-height` to specify image dimensions or `--random-image-size` for squared image dimensions.
- The `random_image` dataset generates synthetic images for benchmarking


## Benchmark using AIPerf

TensorRT-LLM also supports benchmarking `trtllm-serve` using [**AIPerf**](https://github.com/ai-dynamo/aiperf), NVIDIA’s
comprehensive benchmarking tool for LLMs.  
AIPerf provides throughput, latency, TTFT, and concurrency measurements for both
text and multimodal workloads.

AIPerf integrates directly with the OpenAI-compatible endpoints exposed by
`trtllm-serve`.

### Installation

AIPerf is installed with TensorRT-LLM by default.  


### Running AIPerf with trtllm-serve
TensorRT-LLM provides example scripts under:

- `examples/serve/aiperf_client.sh`
- `examples/serve/aiperf_client_for_multimodal.sh`

These scripts demonstrate how to benchmark a running trtllm-serve instance using
the profile command in AIPerf.

### Example: Benchmark a text model

Once trtllm-serve is running on localhost:8000, run:

```bash
bash examples/serve/aiperf_client.sh
```

The script issues a profiling run:

```bash
aiperf profile \
    -m TinyLlama-1.1B-Chat-v1.0 \
    --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type chat \
    --random-seed 123 \
    --synthetic-input-tokens-mean 128 \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 128 \
    --output-tokens-stddev 0 \
    --request-count 100 \
    --request-rate 10 \
    --profile-export-file my_profile_export.json \
    --url localhost:8000 \
    --streaming
```

### Example: Benchmark a multimodal model

Benchmark multimodal inference using:

```bash
bash examples/serve/aiperf_client_for_multimodal.sh
```

This runs:

```bash
aiperf profile \
    -m Qwen2.5-VL-3B-Instruct \
    --tokenizer Qwen/Qwen2.5-VL-3B-Instruct \
    --endpoint-type chat \
    --random-seed 123 \
    --image-width-mean 64 \
    --image-height-mean 64 \
    --image-format png \
    --synthetic-input-tokens-mean 128 \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 128 \
    --output-tokens-stddev 0 \
    --request-count 5 \
    --request-rate 1 \
    --profile-export-file my_profile_export.json \
    --url localhost:8000 \
    --streaming
```

---

# Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware

## Introduction

This deployment guide provides step-by-step instructions for running the DeepSeek R1 model using TensorRT LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT LLM parameters, launching the server, and validating inference output.

The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.

## Prerequisites

* GPU: NVIDIA Blackwell or Hopper Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* FP8 model: [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
* NVFP4 model: [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4)


## MoE Backend Support Matrix

There are multiple MOE backends inside TensorRT LLM, not all of them supporting every  precision on every GPUs. Here are the support matrix of the MOE backends.

| device | Checkpoint | Supported moe_backend |
|----------|----------|----------|
| H100/H200 | FP8 | CUTLASS |
| B200/GB200 EP<=8 | NVFP4 | CUTLASS, TRTLLM |
| B200/GB200 EP<=8 | FP8 | DEEPGEMM |
| GB200 NVL72 EP>8 | NVFP4 |  WIDEEP |
| GB200 NVL72 EP>8 | FP8 | WIDEEP without EPLB |

The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model.

## Deployment Steps

### Run Docker Container

Run the docker container using the TensorRT LLM NVIDIA NGC image.

```shell
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
/bin/bash
```

Note:

* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.

If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/deepseek-r1-throughput.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/deepseek-r1-deepgemm.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.

```shell
trtllm-serve deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.


#### `tensor_parallel_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.

#### `max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.

#### `max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.

#### `trust_remote_code`

* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `kv_cache_config`

* **Description**: A section for configuring the Key-Value (KV) cache.

* **Options**:

  * `dtype`: Sets the data type for the KV cache.
    **Default**: `"auto"` (uses the data type specified in the model checkpoint).

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

  * `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

    **Default**: `false`

  * `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.

    **Default**: `0`

    **Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.

  * `batch_sizes`: A specific list of batch sizes to create CUDA graphs for.

     **Default**: `None`

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

  * `backend`: The backend to use for MoE operations.
    **Default**: `CUTLASS`

#### `attention_backend`

* **Description**: The backend to use for attention calculations.

* **Default**: `TRTLLM`

See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the YAML configuration file.

### Wide Expert Parallelism

Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP:
```yaml
moe_config:
    backend: WIDEEP
    max_num_tokens: 9216
    load_balancer:  # configure online EP balancer
      num_slots: 288
      layer_updates_per_iter: 1
```

Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) for more details.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
      "model": "deepseek-ai/DeepSeek-R1-0528",
      "prompt": "Where is New York?",
      "max_tokens": 16,
      "temperature": 0
}'
```

Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.

```json
{"id":"cmpl-e728f08114c042309efeae4df86a50ca","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
  * For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable `PYTORCH_ALLOC_CONF=max_split_size_mb:8192`. For more details, please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.

### Running Evaluations to Verify Accuracy (Optional)

We use the `lm-eval` tool to test the model’s accuracy. For more information see [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

To run the evaluation harness exec into the running TensorRT LLM container and install with this command:

```shell
docker exec -it tensorrt_llm /bin/bash

pip install -U lm-eval
```

FP8 command for GSM8K:

* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So, set `add_special_tokens=False` to avoid it.

```shell
MODEL_PATH=deepseek-ai/DeepSeek-R1-0528

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
```

Sample result in Blackwell:

```
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|
```

FP4 command for GSM8K:

* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So set `add_special_tokens=False` to avoid it.

```shell
MODEL_PATH=nvidia/DeepSeek-R1-0528-FP4

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
```

Sample result in Blackwell:

```
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|
```

## Benchmarking Performance

To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<EOF >  bench.sh
concurrency_list="32 64 128 256 512 1024 2048 4096"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/deepseek_r1_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model deepseek-ai/DeepSeek-R1-0528 \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To benchmark the FP4 model, replace `--model deepseek-ai/DeepSeek-R1-0528` with `--model nvidia/DeepSeek-R1-0528-FP4`.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.

```shell
./bench.sh
```

Sample TensorRT LLM serving benchmark output. Your results may vary due to ongoing software optimizations.

```
============ Serving Benchmark Result ============
Successful requests:                      16
Benchmark duration (s):                   17.66
Total input tokens:                       16384
Total generated tokens:                   16384
Request throughput (req/s):               [result]
Output token throughput (tok/s):          [result]
Total Token throughput (tok/s):           [result]
User throughput (tok/s):                  [result]
---------------Time to First Token----------------
Mean TTFT (ms):                           [result]
Median TTFT (ms):                         [result]
P99 TTFT (ms):                            [result]
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                           [result]
Median TPOT (ms):                         [result]
P99 TPOT (ms):                            [result]
---------------Inter-token Latency----------------
Mean ITL (ms):                            [result]
Median ITL (ms):                          [result]
P99 ITL (ms):                             [result]
----------------End-to-end Latency----------------
Mean E2EL (ms):                           [result]
Median E2EL (ms):                         [result]
P99 E2EL (ms):                            [result]
==================================================
```

### Key Metrics

#### Time to First Token (TTFT)
  * The typical time elapsed from when a request is sent until the first output token is generated.

#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
  * TPOT is the typical time required to generate each token *after* the first one.
  * ITL is the typical time delay between the completion of one token and the completion of the next.
  * Both TPOT and ITL ignore TTFT.

For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:

$$
\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
$$

Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):

$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
$$

$$
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
$$

#### End-to-End (E2E) Latency
  * The typical total time from when a request is submitted until the final token of the response is received.

#### Total Token Throughput
  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

$$
\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

#### Tokens Per Second (TPS) or Output Token Throughput
  * how many output tokens the system generates each second.

$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

## Preconfigured Recipes

The following sections help you pick a known-good `trtllm-serve --config` for your target GPU and traffic pattern.

### Recipe selector

```{eval-rst}
.. trtllm_config_selector::
   :models: deepseek-ai/DeepSeek-R1-0528, nvidia/DeepSeek-R1-0528-FP4-v2
```

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-traffic-patterns
   :end-before: .. end-note-traffic-patterns
```

### Recipe database

```{eval-rst}
.. include:: config_table.rst
   :start-after: .. start-deepseek-ai/DeepSeek-R1-0528
   :end-before: .. end-deepseek-ai/DeepSeek-R1-0528
```

```{eval-rst}
.. include:: config_table.rst
   :start-after: .. start-nvidia/DeepSeek-R1-0528-FP4-v2
   :end-before: .. end-nvidia/DeepSeek-R1-0528-FP4-v2
```

---

# Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware

## Introduction

This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.

The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving.

## Prerequisites

* GPU: NVIDIA Blackwell Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* MXFP4 model: [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b)


## MoE Backend Support Matrix

There are multiple MOE backends inside TensorRT LLM. Here are the support matrix of the MOE backends.

| Device                | Activation Type | MoE Weights Type | MoE Backend | Use Case                       |
|---------------------- |-----------------|------------------|-------------|--------------------------------|
| B200/GB200/B300/GB300 | MXFP8           | MXFP4            | TRTLLM      | Low Latency and Max Throughput |
|         H200          | BF16            | MXFP4            | TRITON      | Low Latency and Max Throughput |

The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model.
For Blackwell, `CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster. For Hopper, Triton is the faster backend.

## Deployment Steps

### Run Docker Container

Run the docker container using the TensorRT-LLM NVIDIA NGC image.

```shell
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
/bin/bash
```

Note:

* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.

If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

For low-latency use cases:

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-latency.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

For max-throughput use cases:

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-throughput.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.

```shell
trtllm-serve openai/gpt-oss-120b --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.

#### `tensor_parallel_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.

#### `max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output) and GPU memory available for KV cache.

#### `max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. All input tokens (prefill phase) per request and 1 output token per decode request count towards this threshold.

#### `max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. If not set, it will be inferred from model config.

#### `trust_remote_code`

* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

  * `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

    **Default**: `false`

  * `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.

    **Default**: `0`

    **Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

  * `backend`: The backend to use for MoE operations.
    **Default**: `CUTLASS`

See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the YAML configuration file.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT-LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
        {
            "role": "user",
            "content": "Where is New York?"
        }
    ],
    "max_tokens": 1024,
    "top_p": 1.0
}' -w "\n"
```

Here is an example response, showing that the TensorRT LLM server reasons and answers the questions.

```json
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania.  \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**.  \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**.  \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area).  \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.**  \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.

### Running Evaluations to Verify Accuracy (Optional)

We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
With the added support of Chat Completions and Responses API in `trtllm-serve,` `gpt_oss.evals` works directly without any modifications.

You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.

| **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
|:--------------------:|:--------------------------:|:------------------:|:------------------:|
| low/medium           | DEP8 / DEP4                | 128                | 32768              |
| high                 | DEP8 / DEP4                | 2                  | 133120             |
| low/medium           | TP8 / TP4                  | 1024               | 32768              |
| high                 | TP8 / TP4                  | 720                | 133120             |

Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.

```shell
# execute this command in gpt-oss
python -m gpt_oss.evals \
  --sampler chat_completions \
  --eval gpqa,aime25 \
  --model gpt-oss-120b \
  --reasoning-effort low,medium
```


## Benchmarking Performance

To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="32 64 128 256 512 1024 2048 4096"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/gpt_oss_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model openai/gpt-oss-120b \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.

```shell
./bench.sh
```

Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations.

```
============ Serving Benchmark Result ============
Successful requests:                      16
Benchmark duration (s):                   17.66
Total input tokens:                       16384
Total generated tokens:                   16384
Request throughput (req/s):               [result]
Output token throughput (tok/s):          [result]
Total Token throughput (tok/s):           [result]
User throughput (tok/s):                  [result]
---------------Time to First Token----------------
Mean TTFT (ms):                           [result]
Median TTFT (ms):                         [result]
P99 TTFT (ms):                            [result]
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                           [result]
Median TPOT (ms):                         [result]
P99 TPOT (ms):                            [result]
---------------Inter-token Latency----------------
Mean ITL (ms):                            [result]
Median ITL (ms):                          [result]
P99 ITL (ms):                             [result]
----------------End-to-end Latency----------------
Mean E2EL (ms):                           [result]
Median E2EL (ms):                         [result]
P99 E2EL (ms):                            [result]
==================================================
```

### Key Metrics

#### Time to First Token (TTFT)
  * The typical time elapsed from when a request is sent until the first output token is generated.

#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
  * TPOT is the typical time required to generate each token *after* the first one.
  * ITL is the typical time delay between the completion of one token and the completion of the next.
  * Both TPOT and ITL ignore TTFT.

For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:

$$
\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
$$

Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):

$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
$$

$$
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
$$

#### End-to-End (E2E) Latency
  * The typical total time from when a request is submitted until the final token of the response is received.

#### Total Token Throughput
  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

$$
\text{Total TPS} = \frac{\text{Total input tokens}+\text{Total generated tokens}}{T_{last} - T_{first}}
$$

#### Tokens Per Second (TPS) or Output Token Throughput
  * how many output tokens the system generates each second.

$$
\text{TPS} = \frac{\text{Total generated tokens}}{T_{last} - T_{first}}
$$

## Preconfigured Recipes

The following sections help you pick a known-good `trtllm-serve --config` for your target GPU and traffic pattern.

### Recipe selector

```{eval-rst}
.. trtllm_config_selector::
   :models: openai/gpt-oss-120b
```

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-traffic-patterns
   :end-before: .. end-note-traffic-patterns
```

### Recipe database

```{eval-rst}
.. include:: config_table.rst
   :start-after: .. start-openai/gpt-oss-120b
   :end-before: .. end-openai/gpt-oss-120b
```

---

# Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell

## Introduction

This is a quickstart guide for running the Kimi K2 Thinking model on TensorRT LLM. It focuses on a working setup with recommended defaults.

## Prerequisites

* GPU: NVIDIA Blackwell Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* NVFP4 model: [Kimi-K2-Thinking-NVFP4](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4)


## Deploy Kimi K2 Thinking on DGX B200 through Docker

### Prepare Docker image

Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
```bash
cd TensorRT-LLM

make -C docker release_build IMAGE_TAG=kimi-k2-thinking-local

make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=kimi-k2-thinking-local LOCAL_USER=1
```

### Launch the TensorRT LLM Server

Prepare an `EXTRA_OPTIONS_YAML_FILE` that specifies LLM API arguments when deploying the model. An example YAML file is as follows:

```yaml
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
  free_gpu_memory_fraction: 0.75
  dtype: fp8
cache_transceiver_config:
  backend: UCX
  max_tokens_in_buffer: 8448
trust_remote_code: true
```

This YAML file specifies configurations that deploy the model with 8-way expert parallelism for the MoE part and 8-way attention data parallelism. It also enables `trust_remote_code`, so that it works with the Kimi K2 Thinking customized [tokenizer](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4/blob/main/tokenization_kimi.py).


With the `EXTRA_OPTIONS_YAML_FILE`, use the following example command to launch the TensorRT LLM server with the Kimi-K2-Thinking-NVFP4 model from within the container.

```bash
trtllm-serve nvidia/Kimi-K2-Thinking-NVFP4 \
    --host 0.0.0.0 --port 8000 \
    --config ${EXTRA_OPTIONS_YAML_FILE}
```

TensorRT LLM will load weights and select the best kernels during startup. The server is successfully launched when the following log is shown:

```log
INFO:     Started server process [xxxxx]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
```

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries.

## Deploy Kimi K2 Thinking on GB200 NVL72 through SLURM with wide EP and disaggregated serving

TensorRT LLM provides a set of SLURM scripts that can be easily configured through YAML files and automatically launch SLURM jobs on GB200 NVL72 clusters for deployment, benchmarking, and accuracy testing purposes. The scripts are located at `examples/disaggregated/slurm/benchmark`. Refer to [this page](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) for more details and example wide EP config files.

For Kimi K2 Thinking, an example configuration for SLURM arguments and the scripts is as follows:

```yaml
# SLURM Configuration
slurm:
  script_file: "disaggr_torch.slurm"
  partition: "<partition>"
  account: "<account>"
  job_time: "02:00:00"
  job_name: "<job_name>"
  extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
  numa_bind: true # Only enable for GB200 NVL72

# Benchmark Mode
benchmark:
  mode: "e2e"  # Options: e2e, gen_only
  use_nv_sa_benchmark: false  # Whether to use NVIDIA SA benchmark script
  multi_round: 8  # Number of benchmark rounds
  benchmark_ratio: 0.8  # Benchmark ratio
  streaming: true  # Enable streaming mode
  concurrency_list: "16"
  input_length: 1024  # Input sequence length
  output_length: 1024  # Output sequence length
  dataset_file: "<dataset_file>"

# Hardware Configuration
hardware:
  gpus_per_node: 4  # Modify this with your hardware configuration
  num_ctx_servers: 4  # Number of context servers
  num_gen_servers: 1  # Number of generation servers

# Environment Configuration
environment:
  container_mount: "<container_mount>"  # Format: path1:path1,path2:path2
  container_image: "<container_image>"
  model_path: "<model_path>"
  trtllm_repo: "<trtllm_repo>"
  build_wheel: false  # Don't build the wheel when launching multiple jobs
  trtllm_wheel_path: ""  # Path to pre-built TensorRT-LLM wheel. If provided, install from this wheel instead
  work_dir: "<full_path_to_work_dir>"
  worker_env_var: "TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1 TRTLLM_ENABLE_PDL=1 ENROOT_ALLOW_DEV=yes"
  server_env_var: "TRTLLM_SERVER_DISABLE_GC=1"

# Worker Configuration
worker_config:
  gen:
    tensor_parallel_size: 32
    moe_expert_parallel_size: 32
    enable_attention_dp: true
    enable_lm_head_tp_in_adp: true
    pipeline_parallel_size: 1
    max_batch_size: 128
    max_num_tokens: 128
    max_seq_len: 9236
    cuda_graph_config:
      enable_padding: true
      batch_sizes:
      - 1
      - 2
      - 4
      - 8
      - 16
      - 32
      - 64
      - 128
      - 256
      - 512
      - 768
      - 1024
      - 2048
    print_iter_log: true
    kv_cache_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.6
      dtype: fp8
    moe_config:
      backend: WIDEEP
      use_low_precision_moe_combine: true
      load_balancer:
        num_slots: 416
        layer_updates_per_iter: 1
    cache_transceiver_config:
      backend: UCX
      max_tokens_in_buffer: 8448
    stream_interval: 20
    num_postprocess_workers: 4
    trust_remote_code: true
  ctx:
    max_batch_size: 1
    max_num_tokens: 8448
    max_seq_len: 8212
    tensor_parallel_size: 4
    moe_expert_parallel_size: 4
    enable_attention_dp: true
    pipeline_parallel_size: 1
    print_iter_log: true
    cuda_graph_config: null
    disable_overlap_scheduler: true
    kv_cache_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.75
      dtype: fp8
    cache_transceiver_config:
      backend: UCX
      max_tokens_in_buffer: 8448
    trust_remote_code: true
```

It includes SLURM-specific configurations, benchmark and hardware details, and environment settings. The `worker_config` field includes detailed settings for context and generation servers when deploying a disaggregated server, with each specified as a list of LLM API arguments.

To launch SLURM jobs with the YAML config file, execute the following command:
```shell
cd <TensorRT LLM root>/examples/disaggregated/slurm/benchmark
python3 submit.py -c config.yaml
```

## Query the OpenAI-compatible API Endpoint

After the TensorRT LLM server is set up and shows `Application startup complete`, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "nvidia/Kimi-K2-Thinking-NVFP4",
    "messages": [
        {
            "role": "user",
            "content": "Where is New York?"
        }
    ],
    "max_tokens": 128,
    "top_p": 1.0
}' -w "\n"
```

Example response:

```json
{
  "id": "chatcmpl-5907ed752eb44d11a12893b19f79f8ca",
  "object": "chat.completion",
  "created": 1764866686,
  "model": "nvidia/Kimi-K2-Thinking-NVFP4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think> The user is asking a very simple question: \"Where is New York?\" This could be interpreted in a few ways:\n\n1. Where is New York State located?\n2. Where is New York City located?\n3. Where is New York located in relation to something else?\n\nGiven the ambiguity, I should provide a comprehensive answer that covers the main interpretations. I should be clear and direct.\n\nLet me structure my answer:\n- First, clarify that \"New York\" can refer to either New York State or New York City\n- For New York State: It's located in the northeastern United States, bordered by New Jersey, Pennsylvania, Connecticut",
        "reasoning_content": "",
        "reasoning": null,
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "mm_embedding_handle": null,
      "disaggregated_params": null,
      "avg_decoded_tokens_per_iter": 1.0
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 140,
    "completion_tokens": 128,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": null
}
```

## Benchmark

To benchmark the performance of your TensorRT LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/kimi_k2_thinking_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Kimi-K2-Thinking-NVFP4 \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

If you want to save the results to a file, add the following options:

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options, see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py).

Run `bench.sh` to begin a serving benchmark.

```shell
./bench.sh
```

## Troubleshooting

Since Kimi K2 Thinking has larger weight size than other models, it's possible seeing host OOM issues, as the following:

```log
Loading weights: 100%|█████████████████████| 1408/1408 [03:43<00:00,  6.30it/s]
 0: [12/04/2025-18:38:28] [TRT-LLM] [RANK 0] [I] moe_load_balancer finalizing model...
 1: [nvl72136-T14:452151:0:452151] Caught signal 7 (Bus error: nonexistent physical address)
 1: ==== backtrace (tid: 452151) ====
 1:  0  /usr/local/ucx//lib/libucs.so.0(ucs_handle_error+0x2cc) [0xffff9638274c]
 1:  1  /usr/local/ucx//lib/libucs.so.0(+0x328fc) [0xffff963828fc]
 1:  2  /usr/local/ucx//lib/libucs.so.0(+0x32c78) [0xffff96382c78]
```
This can be addressed by mounting `tmpfs:/dev/shm:size=640G` when launching the Docker container, to increase the shm size that the container can access.

---

# Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware

## Introduction

This deployment guide provides step-by-step instructions for running the Llama 3.3-70B Instruct model using TensorRT LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT LLM parameters, launching the server, and validating inference output.

The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.

## Access & Licensing

To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License ([https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). NVIDIA’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.

## Prerequisites

GPU: NVIDIA Blackwell or Hopper Architecture
OS: Linux
Drivers: CUDA Driver 575 or Later
Docker with NVIDIA Container Toolkit installed
Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* FP8 model: [Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8)
* NVFP4 model: [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)


Note that NVFP4 is only supported on NVIDIA Blackwell

## Deployment Steps

### Run Docker Container

Run the docker container using the TensorRT LLM NVIDIA NGC image.

```shell
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
/bin/bash
```

Note:

* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using  mkdir \~/.cache
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.

If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/llama-3.3-70b.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.

```shell
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.

#### `tensor_parallel_size`

&emsp;**Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

&emsp;**Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_free_gpu_memory_fraction`

&emsp;**Description:** A value between 0.0 and 1.0 that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.

&emsp;**Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.

#### `max_batch_size`

&emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.

#### `max_num_tokens`

&emsp;**Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `max_seq_len`

&emsp;**Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.

#### `trust_remote_code`

&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `kv_cache_config`

&emsp;**Description**: A section for configuring the Key-Value (KV) cache.

&emsp;**Options**:

&emsp;&emsp;dtype: Sets the data type for the KV cache.

&emsp;&emsp;**Default**: auto (uses the data type specified in the model checkpoint).

#### `cuda_graph_config`

&emsp;**Description**: A section for configuring CUDA graphs to optimize performance.

&emsp;**Options**:

&emsp;&emsp;enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.

&emsp;&emsp;**Default**: false

&emsp;&emsp;max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.

&emsp;&emsp;**Default**: 0

&emsp;&emsp;**Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.

&emsp;&emsp;batch\_sizes: A specific list of batch sizes to create CUDA graphs for.

&emsp;&emsp;**Default**: None

#### `moe_config`

&emsp;**Description**: Configuration for Mixture-of-Experts (MoE) models.

&emsp;**Options**:

&emsp;&emsp;backend: The backend to use for MoE operations.

&emsp;&emsp;**Default**: CUTLASS

#### `attention_backend`

&emsp;**Description**: The backend to use for attention calculations.

&emsp;**Default**: TRTLLM

See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class for the full list of options which can be used in the YAML configuration file.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
      "model": "nvidia/Llama-3.3-70B-Instruct-FP8",
      "prompt": "Where is New York?",
      "max_tokens": 16,
      "temperature": 0
}'
```

Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.

```json
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
* Ensure your model checkpoints are compatible with the expected format
* For performance issues, check GPU utilization with nvidia-smi while the server is running
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
* For connection issues, make sure port 8000 is not being used by another application

### Running Evaluations to Verify Accuracy (Optional)

We use the lm-eval tool to test the model’s accuracy. For more information see [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

To run the evaluation harness exec into the running TensorRT LLM container and install with this command:

```shell
docker exec -it tensorrt_llm /bin/bash

pip install -U lm-eval
```

FP8 command for GSM8K

* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, set add\_special\_tokens=False to avoid it.

```
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP8

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
```

Sample result in Blackwell.

```
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9348|±  |0.0068|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0087|
```

FP4 command for GSM8K

* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it.

```shell
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP4

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
```

Sample result in Blackwell

```shell
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9356|±  |0.0068|
|     |       |strict-match    |     5|exact_match|↑  |0.8393|±  |0.0101|
```

## Benchmarking Performance

To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<EOF >  bench.sh
concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/llama3.3_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Llama-3.3-70B-Instruct-FP8 \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To benchmark the FP4 model, replace \--model nvidia/Llama-3.3-70B-Instruct-FP8 with \--model nvidia/Llama-3.3-70B-Instruct-FP4.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.

```shell
./bench.sh
```

Sample TensorRT LLM serving benchmark output. Your results may vary due to ongoing software optimizations.

```
============ Serving Benchmark Result ============
Successful requests:                      16
Benchmark duration (s):                   17.66
Total input tokens:                       16384
Total generated tokens:                   16384
Request throughput (req/s):               [result]
Output token throughput (tok/s):          [result]
Total Token throughput (tok/s):           [result]
User throughput (tok/s):                  [result]
---------------Time to First Token----------------
Mean TTFT (ms):                           [result]
Median TTFT (ms):                         [result]
P99 TTFT (ms):                            [result]
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                           [result]
Median TPOT (ms):                         [result]
P99 TPOT (ms):                            [result]
---------------Inter-token Latency----------------
Mean ITL (ms):                            [result]
Median ITL (ms):                          [result]
P99 ITL (ms):                             [result]
----------------End-to-end Latency----------------
Mean E2EL (ms):                           [result]
Median E2EL (ms):                         [result]
P99 E2EL (ms):                            [result]
==================================================
```

### Key Metrics

#### Time to First Token (TTFT)
  * The typical time elapsed from when a request is sent until the first output token is generated.

#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
  * TPOT is the typical time required to generate each token *after* the first one.
  * ITL is the typical time delay between the completion of one token and the completion of the next.
  * Both TPOT and ITL ignore TTFT.

For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:

$$
\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
$$

Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):

$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
$$

$$
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
$$

#### End-to-End (E2E) Latency
  * The typical total time from when a request is submitted until the final token of the response is received.

#### Total Token Throughput
  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

$$
\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

#### Tokens Per Second (TPS) or Output Token Throughput
  * how many output tokens the system generates each second.

$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

---

# Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware

## Introduction

This deployment guide provides step-by-step instructions for running the Llama-4-Scout-17B-16E-Instruct model using TensorRT LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT LLM parameters, launching the server, and validating inference output.

The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.

## Access & Licensing

To use Llama4 Scout 17B, you must first agree to Meta’s [Llama 4 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE). NVIDIA’s quantized versions (FP8 and NVFP4) are built on top of the base model and are available for research and commercial use under the same license.

## Prerequisites

* GPU: NVIDIA Blackwell or Hopper Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* FP8 model: [Llama-4-Scout-17B-16E-Instruct-FP8](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8)
* NVFP4 model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)

Note that NVFP4 is only supported on NVIDIA Blackwell platform.

## Deployment Steps

### Run Docker Container

Run the docker container using the TensorRT LLM NVIDIA NGC image.

```shell
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
/bin/bash
```

Note:

* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.

If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/llama-4-scout.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````

### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.

```shell
trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.

#### `tensor_parallel_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.

#### `max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.

#### `max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.

#### `trust_remote_code`

&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `kv_cache_config`

* **Description**: A section for configuring the Key-Value (KV) cache.

* **Options**:

  * `dtype`: Sets the data type for the KV cache.
    **Default**: `"auto"` (uses the data type specified in the model checkpoint).

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

  * `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

    **Default**: `false`

  * `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.

    **Default**: `0`

    **Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.

  * `batch_sizes`: A specific list of batch sizes to create CUDA graphs for.

     **Default**: `None`

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

  * `backend`: The backend to use for MoE operations.
    **Default**: `CUTLASS`

#### `attention_backend`

* **Description**: The backend to use for attention calculations.

* **Default**: `TRTLLM`

See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class for the full list of options which can be used in the YAML configuration file.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
      "model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8",
      "prompt": "Where is New York?",
      "max_tokens": 16,
      "temperature": 0
}'
```

Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.

```json
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"$MODEL","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.

### Running Evaluations to Verify Accuracy (Optional)

We use the lm-eval tool to test the model’s accuracy. For more information see <https://github.com/EleutherAI/lm-evaluation-harness>.

To run the evaluation harness exec into the running TensorRT LLM container and install with this command:

```shell
docker exec -it tensorrt_llm /bin/bash

pip install -U lm-eval
```

FP8 command for GSM8K

```shell
MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
```

Sample result in Blackwell.

```
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9189|±  |0.0075|
|     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|
```

FP4 command for GSM8K

```shell
MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4

lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
```

Sample result in Blackwell

```
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8908|±  |0.0086|
```

## Benchmarking Performance

To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh`(http://bench.sh) script.

```shell
cat <<EOF >  bench.sh
concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/llama4_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To benchmark the FP4 model, replace `--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP8` with `--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.

```shell
./bench.sh
```

Sample TensorRT LLM serving benchmark output. Your results may vary due to ongoing software optimizations.

```
============ Serving Benchmark Result ============
Successful requests:                      16
Benchmark duration (s):                   17.66
Total input tokens:                       16384
Total generated tokens:                   16384
Request throughput (req/s):               [result]
Output token throughput (tok/s):          [result]
Total Token throughput (tok/s):           [result]
User throughput (tok/s):                  [result]
---------------Time to First Token----------------
Mean TTFT (ms):                           [result]
Median TTFT (ms):                         [result]
P99 TTFT (ms):                            [result]
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                           [result]
Median TPOT (ms):                         [result]
P99 TPOT (ms):                            [result]
---------------Inter-token Latency----------------
Mean ITL (ms):                            [result]
Median ITL (ms):                          [result]
P99 ITL (ms):                             [result]
----------------End-to-end Latency----------------
Mean E2EL (ms):                           [result]
Median E2EL (ms):                         [result]
P99 E2EL (ms):                            [result]
==================================================
```

### Key Metrics

#### Time to First Token (TTFT)
  * The typical time elapsed from when a request is sent until the first output token is generated.

#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
  * TPOT is the typical time required to generate each token *after* the first one.
  * ITL is the typical time delay between the completion of one token and the completion of the next.
  * Both TPOT and ITL ignore TTFT.

For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:

$$
\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
$$

Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):

$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
$$

$$
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
$$

#### End-to-End (E2E) Latency
  * The typical total time from when a request is submitted until the final token of the response is received.

#### Total Token Throughput
  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

$$
\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

#### Tokens Per Second (TPS) or Output Token Throughput
  * how many output tokens the system generates each second.

$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$

---

# Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware

## Introduction

This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.

## Prerequisites

* GPU: NVIDIA Blackwell or Hopper Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* BF16 model: [Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)

## Deployment Steps

### Run Docker Container

Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
```
cd TensorRT-LLM

make -C docker release_build IMAGE_TAG=qwen3-next-local

make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-next-local LOCAL_USER=1
```

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/qwen3-next.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````


### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the Qwen3-Next model from within the container.

```shell
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.

#### `tensor_parallel_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_config.free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.


#### `max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).

#### `max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.

#### `trust_remote_code`

* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

  * `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

    **Default**: `false`

  * `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.

    **Default**: `0`

    **Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

  * `backend`: The backend to use for MoE operations.
    **Default**: `CUTLASS`

See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the YAML configuration file.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "Qwen/Qwen3-Next-80B-A3B-Thinking",
    "messages": [
        {
            "role": "user",
            "content": "Where is New York?"
        }
    ],
    "max_tokens": 1024,
    "top_p": 1.0
}' -w "\n"
```

Here is an example response:

```
{"id":"chatcmpl-64ac201c77bf46a7a3a4eca7759b1fd8","object":"chat.completion","created":1759022940,"model":"Qwen/Qwen3-Next-80B-A3B-Thinking","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, the user is asking \"Where is New York?\" Hmm, this seems straightforward but I need to be careful. New York could mean different things—maybe they're confused about the city versus the state. \n\nFirst thought: Are they a tourist planning a trip? Or maybe a student doing homework? Could even be someone国外 who's only heard \"New York\" in movies and isn't sure if it's a city or state. \n\nI should clarify both possibilities immediately. People often mix them up. Like, if someone says \"I'm going to New York\" they're probably talking about NYC, but technically New York State is bigger. \n\nLet me break it down: \n- New York City (NYC) is the famous one—Manhattan, skyscrapers, Times Square. \n- Then New York State (NY) is the whole state, which includes NYC but also upstate areas like Albany (the capital), Buffalo, and even the Adirondacks. \n\nWait, should I mention that NYC is in New York State? Yeah, that's crucial. Otherwise they might think it's two separate things. Also, where is the state located? Northeast US, borders other states like Pennsylvania, New Jersey... maybe name a few neighbors for context. \n\nOh! And the city vs state confusion is super common. Like, people say \"New York\" for the city but forget it's part of a larger state. Should I give examples? \n- \"If someone says 'hot dogs in New York' they mean NYC\" \n- But \"the New York State Police\" covers the whole state\" \n\n...Maybe add a fun fact? Like how NYC is just one city within the state? Or that the state capital isn't NYC but Albany? Yeah, that'll help clear confusion. \n\nAlso—should I ask if they meant the city or the state? Nah, better to cover both upfront. Keep it simple but precise. No jargon. They might not know terms like \"metropolitan area\" or \"boroughs,\" so explain \"boroughs\" as neighborhoods (Manhattan, Brooklyn etc.). \n\nCoordinates: 40.7°N, 74°W? Maybe overkill unless they're into geography. But for the state, roughly 42°N to 45°N latitude? Hmm, probably unnecessary. Stick to landmarks: borders Lake Erie, Atlantic coast... \n\nWait—user might be asking literally for GPS coordinates? Unlikely, but just in case I can add: \"Geographically, New York City is at approximately 40.7°N latitude and 74.0°W longitude...\" but only if the answer feels too dry otherwise. \n\nPriority: Clear, simple, correct. No assuming prior knowledge. Say \"New York can refer to...\" then split into two cases. End with \"Most people mean the city when they say 'New York' in casual talk.\" That’s the key takeaway. \n\n...Double-check: Is Albany really the capital? Yes. And NYC is the largest city but not capital. Got it. Also, New York State is the 27th largest by area? Maybe not relevant. Stick to basics. \n\nFinal structure: \n1. \"New York can mean two things: the city or the state.\" \n2. For city: \n   - Where it is (Northeast US) \n   - Inside New York State \n   - Famous parts (Manhattan etc.) \n3. For state: \n   - Larger area, includes city \n   - Borders (PA, NJ, CT, MA, VT, Canada) \n   - Capital is Albany (not NYC!) \n4. Clear summary: \"In everyday talk, 'New York' usually means NYC.\" \n\nYeah. That covers it without overwhelming them. No markdown, just plain English. Keep it warm—add a \"hope that helps!\" at the end.\n</think>\n\nThat's a common point of confusion! **\"New York\" can refer to two different things**, and it's important to distinguish between them:\n\n### 1. **New York City (NYC)**  \n   - **What it is**: The largest and most famous city in the United States, known for landmarks like Times Square, Central Park, the Statue of Liberty, and Wall Street.  \n   - **Where it is**:  \n     - Located in the **northeastern United States**.  \n     - Situated at the mouth of the **Hudson River**, where it meets the **Atlantic Ocean**.  \n     - Part of **New York State** (see below).  \n   - **Geographic details**:  \n     - Coordinates: Approximately **40.7° N latitude, 74.0° W longitude**.  \n     - Composed of **5 boroughs**: Manhattan (the \"city\" most people picture), Brooklyn, Queens, The Bronx, and Staten Island.  \n     - Panoramic view of NYC (including Brooklyn and New Jersey skyline):","reasoning_content":null,"reasoning":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":15,"total_tokens":1039,"completion_tokens":1024},"prompt_token_ids":null}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.


## Benchmarking Performance

To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/qwen3_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model Qwen/Qwen3-Next-80B-A3B-Thinking \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.

```shell
./bench.sh
```

---

# Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware

## Introduction

This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.

## Prerequisites

* GPU: NVIDIA Blackwell or Hopper Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
* [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)
* [Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8)
* [Qwen3-30B-A3B-NVFP4](https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4)
* [Qwen3-235B-A22B-NVFP4](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4)

## Deployment Steps

### Run Docker Container

Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.

```shell
cd TensorRT-LLM

make -C docker release_build IMAGE_TAG=qwen3-local

make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-local LOCAL_USER=1
```

### Recommended Performance Settings

We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.

```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml
```

Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

````{admonition} Show code
:class: dropdown

```{literalinclude} ../../../examples/configs/curated/qwen3.yaml
---
language: shell
prepend: |
  EXTRA_LLM_API_FILE=/tmp/config.yml

  cat << EOF > ${EXTRA_LLM_API_FILE}
append: EOF
---
```
````


### Launch the TensorRT LLM Server

Below is an example command to launch the TensorRT LLM server with the Qwen3 model from within the container.

```shell
trtllm-serve Qwen/Qwen3-30B-A3B --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE}
```

After the server is set up, the client can now send prompt requests to the server and receive results.

### LLM API Options (YAML Configuration)

<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->

These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument.

#### `tensor_parallel_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `moe_expert_parallel_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `kv_cache_free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.


#### `max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).

#### `max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.

#### `trust_remote_code`
* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

  * `enable_padding`: If `true`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

    **Default**: `false`

  * `batch_sizes`: List of batch sizes for which CUDA graphs will be pre-captured.

    **Recommendation**: Set this to cover the range of batch sizes you expect in production.

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

  * `backend`: The backend to use for MoE operations.

    **Default**: `CUTLASS`

See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the YAML configuration file.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "Qwen/Qwen3-30B-A3B",
    "messages": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
}' -w "\n"
```

Here is an example response:

```json
{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1759022940,
  "model": "Qwen/Qwen3-30B-A3B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its rich history, culture, art, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 58,
    "total_tokens": 73
  }
}
```

### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_num_tokens`, or `kv_cache_free_gpu_memory_fraction`.
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
* For MoE models (Qwen3-30B-A3B, Qwen3-235B-A22B), ensure `moe_expert_parallel_size` is properly configured.

## Benchmarking Performance

To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first create a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

# Adjust the model name based on which Qwen3 model you're benchmarking
MODEL_NAME="Qwen/Qwen3-30B-A3B"

concurrency_list="1 2 4 8 16 32 64 128"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/qwen3_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model ${MODEL_NAME} \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`.

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.

```shell
./bench.sh
```

---

# LLM API Change Guide

This guide explains how to modify and manage APIs in TensorRT LLM, focusing on the high-level LLM API.

## Overview

TensorRT LLM provides multiple API levels:

1. **LLM API** - The highest-level API (e.g., the `LLM` class)
2. **PyExecutor API** - The mid-level API (e.g., the `PyExecutor` class)

This guide focuses on the LLM API, which is the primary interface for most users.

## API Types and Stability Guarantees

TensorRT LLM classifies APIs into two categories:

### 1. Committed APIs
- **Stable** and guaranteed to remain consistent across releases
- No breaking changes without major version updates
- Schema stored in: `tests/unittest/api_stability/references_committed/`

### 2. Non-committed APIs
- Under active development and may change between releases
- Marked with a `status` field in the docstring:
  - `prototype` - Early experimental stage
  - `beta` - More stable but still subject to change
  - `deprecated` - Scheduled for removal
- Schema stored in: `tests/unittest/api_stability/references/`
- See [API status documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html) for complete details

## API Schema Management

All API schemas are:
- Stored as YAML files in the codebase
- Protected by unit tests in `tests/unittest/api_stability/`
- Automatically validated to ensure consistency

## API Change Principles

### 1. Knob Naming

**Use Semantic Clarity**

Argument names should describe what the argument represents, not how it is used internally.

✅ **Good**: `max_new_tokens` (clear meaning)

❌ **Bad**: `num` (ambiguous)

**Reflect Argument Type and Granularity**

- For **boolean** knobs, prefix with verbs like `enable_` and so on.

  Examples: `enable_cache`, `enable_flash_attention`

- For **numerical threshold** knobs, suffix with `_limit`, `_size`, `_count`, `_len_` or `_ratio`

  Examples: `max_seq_len`, `prefill_batch_size`

**Avoid Redundant Prefixes**

Example (in `MoeConfig`):

✅ **Good**: `backend`

❌ **Bad**: `moe_backend` (redundant since it's already in `MoeConfig`)

**Use Specific Names for Narrow Scenarios**

When adding knobs for specific use cases, make the name convey the restriction clearly via a prefix. It's acceptable to rename later when the knob becomes more generic or is moved into a dedicated config.

Example (argument to the LLM class):

✅ **Good**: `rope_scaling_factor` → clearly indicates it's for RoPE

❌ **Bad**: `scaling_factor` → too generic and prone to misuse

### 2. Hierarchical Configuration

Organize complex or hierarchical arguments into **dedicated configuration dataclasses** with intuitive and consistent naming.

**Guidelines**

- Use the `XxxConfig` suffix consistently

  Examples: `ModelConfig`, `ParallelConfig`, `MoeConfig`

- **Reflect conceptual hierarchy**

  The dataclass name should represent a coherent functional unit, not an arbitrary grouping

- **Avoid over-nesting**

  Use only one level of configuration hierarchy whenever possible (e.g., `LlmArgs → ParallelConfig`) to balance readability and modularity

### 3. Prefer `LlmArgs` Over Environment Variables

`LlmArgs` is the central place for all configuration knobs. It integrates with our infrastructure to ensure:

- **API Stability**
  - Protects committed (stable) APIs
  - GitHub reviewer committee oversees API stability

- **API Status Registration**
  - Uncommitted (unstable) APIs must be marked as `"prototype"` or `"beta"`
  - API statuses are displayed in the documentation

- **API Documentation**
  - Each knob uses a `Field` with a description
  - Automatically rendered in public documentation

> Managing knobs in `LlmArgs` remains **scalable and maintainable** thanks to our existing infrastructure and review processes.

**Drawbacks of Environment Variables:**

- Dispersed across the codebase
- Lack documentation and discoverability
- Pose challenges for testing and validation

**Guidelines for Adding Knobs:**

- ✅ Add clear, descriptive documentation for each field
- ✅ It's fine to add temporary knobs and refine them later
- ⚠️ Always mark temporary knobs as `"prototype"` if not stable yet
- ✅ Refactor prototype knobs as they mature, promote them to "beta" or "stable".

## Modifying LLM Constructor Arguments

The LLM class accepts numerous configuration parameters for models, runtime, and other components. These are managed through a Pydantic dataclass called `LlmArgs`.

### Architecture

- The LLM's `__init__` method parameters map directly to `LlmArgs` fields
- `LlmArgs` is an alias for `TorchLlmArgs` (defined in `tensorrt_llm/llmapi/llm_args.py`)
- All arguments are validated and type-checked through Pydantic

### Adding a New Argument

Follow these steps to add a new constructor argument:

#### 1. Add the field to `TorchLlmArgs`

```python
garbage_collection_gen0_threshold: int = Field(
    default=20000,
    description=(
        "Threshold for Python garbage collection of generation 0 objects. "
        "Lower values trigger more frequent garbage collection."
    ),
    status="beta"  # Required for non-committed arguments
)
```

**Field requirements:**
- **Type annotation**: Required for all fields
- **Default value**: Recommended unless the field is mandatory
- **Description**: Clear explanation of the parameter's purpose
- **Status**: Required for non-committed arguments (`prototype`, `beta`, etc.)

#### 2. Update the API schema

Add the field to the appropriate schema file:

- **Non-committed arguments**: `tests/unittest/api_stability/references/llm.yaml`
  ```yaml
  garbage_collection_gen0_threshold:
    type: int
    default: 20000
    status: beta  # Must match the status in code
  ```

- **Committed arguments**: `tests/unittest/api_stability/references_committed/llm.yaml`
  ```yaml
  garbage_collection_gen0_threshold:
    type: int
    default: 20000
    # No status field for committed arguments
  ```

#### 3. Run validation tests

```bash
python -m pytest tests/unittest/api_stability/test_llm_api.py
```

## Modifying LLM Class Methods

Public methods in the LLM class constitute the API surface. All changes must be properly documented and tracked.

### Implementation Details

- The actual implementation is in the `_TorchLLM` class ([llm.py](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.0/tensorrt_llm/llmapi/llm.py))
- Public methods (not starting with `_`) are automatically exposed as APIs

### Adding a New Method

Follow these steps to add a new API method:

#### 1. Implement the method in `_TorchLLM`

For non-committed APIs, use the `@set_api_status` decorator:

```python
@set_api_status("beta")
def generate_with_streaming(
    self,
    prompts: List[str],
    **kwargs
) -> Iterator[GenerationOutput]:
    """Generate text with streaming output.

    Args:
        prompts: Input prompts for generation
        **kwargs: Additional generation parameters

    Returns:
        Iterator of generation outputs
    """
    # Implementation here
    pass
```

For committed APIs, no decorator is needed:

```python
def generate(self, prompts: List[str], **kwargs) -> GenerationOutput:
    """Generate text from prompts."""
    # Implementation here
    pass
```

#### 2. Update the API schema

Add the method to the appropriate `llm.yaml` file:

**Non-committed API** (`tests/unittest/api_stability/references/llm.yaml`):
```yaml
generate_with_streaming:
  status: beta  # Must match @set_api_status
  parameters:
    - name: prompts
      type: List[str]
    - name: kwargs
      type: dict
  returns: Iterator[GenerationOutput]
```

**Committed API** (`tests/unittest/api_stability/references_committed/llm.yaml`):
```yaml
generate:
  parameters:
    - name: prompts
      type: List[str]
    - name: kwargs
      type: dict
  returns: GenerationOutput
```

### Modifying Existing Methods

When modifying existing methods:

1. **Non-breaking changes** (adding optional parameters):
   - Update the method signature
   - Update the schema file
   - No status change needed

2. **Breaking changes** (changing required parameters, return types):
   - Only allowed for non-committed APIs
   - Consider deprecation path for beta APIs
   - Update documentation with migration guide

### Best Practices

1. **Documentation**: Always include comprehensive docstrings
2. **Type hints**: Use proper type annotations for all parameters and returns
3. **Testing**: Add unit tests for new methods
4. **Examples**: Provide usage examples in the docstring
5. **Validation**: Run API stability tests before submitting changes

### Running Tests

Validate your changes:

```bash
# Run API stability tests
python -m pytest tests/unittest/api_stability/

# Run specific test for LLM API
python -m pytest tests/unittest/api_stability/test_llm_api.py -v
```

## Common Workflows

### Promoting an API from Beta to Committed

1. Remove the `@set_api_status("beta")` decorator from the method
2. Move the schema entry from `tests/unittest/api_stability/references/` to `tests/unittest/api_stability/references_committed/`
3. Remove the `status` field from the schema
4. Update any documentation referring to the API's beta status

### Deprecating an API

1. Add `@set_api_status("deprecated")` to the method
2. Update the schema with `status: deprecated`
3. Add deprecation warning in the method:
   ```python
   import warnings
   warnings.warn(
       "This method is deprecated and will be removed in v2.0. "
       "Use new_method() instead.",
       DeprecationWarning,
       stacklevel=2
   )
   ```
4. Document the migration path

---

# Continuous Integration Overview

This page explains how TensorRT‑LLM's CI is organized and how individual tests map to Jenkins stages. Most stages execute integration tests defined in YAML files, while unit tests run as part of a merge‑request pipeline. The sections below describe how to locate a test and trigger the stage that runs it.

## Table of Contents
1. [CI pipelines](#ci-pipelines)
2. [Test definitions](#test-definitions)
3. [Unit tests](#unit-tests)
4. [Jenkins stage names](#jenkins-stage-names)
5. [Finding the stage for a test](#finding-the-stage-for-a-test)
6. [Waiving tests](#waiving-tests)
7. [Triggering CI Best Practices](#triggering-ci-best-practices)

## CI pipelines

Pull requests do not start testing by themselves. Developers trigger the CI by commenting `/bot run` (optionally with arguments) on the pull request (see [Pull Request Template](../../../.github/pull_request_template.md) for more details). That kicks off the **merge-request pipeline** (defined in `jenkins/L0_MergeRequest.groovy`), which runs unit tests and integration tests whose YAML entries specify `stage: pre_merge`. Once a pull request is merged, a separate **post-merge pipeline** (defined in `jenkins/L0_Test.groovy`) runs every test marked `post_merge` across all supported GPU configurations.

`stage` tags live in the YAML files under `tests/integration/test_lists/test-db/`. Searching those files for `stage: pre_merge` shows exactly which tests the merge-request pipeline covers.

## Test definitions

Integration tests are listed under `tests/integration/test_lists/test-db/`. Most YAML files are named after the GPU or configuration they run on (for example `l0_a100.yml`). Some files, like `l0_sanity_check.yml`, use wildcards and can run on multiple hardware types. Entries contain conditions and a list of tests. Two important terms in each entry are:

- `stage`: either `pre_merge` or `post_merge`.
- `backend`: `pytorch`, `tensorrt` or `triton`.

Example from `l0_a100.yml`:

```yaml
      terms:
        stage: post_merge
        backend: triton
  tests:
  - triton_server/test_triton.py::test_gpt_ib_ptuning[gpt-ib-ptuning]
```

## Unit tests

Unit tests live under `tests/unittest/` and run during the merge-request pipeline. They are invoked from `jenkins/L0_MergeRequest.groovy` and do not require mapping to specific hardware stages.

## Jenkins stage names

`jenkins/L0_Test.groovy` maps stage names to these YAML files.  For A100 the mapping includes:

```groovy
    "A100X-Triton-[Post-Merge]-1": ["a100x", "l0_a100", 1, 2],
    "A100X-Triton-[Post-Merge]-2": ["a100x", "l0_a100", 2, 2],
```

The array elements are: GPU type, YAML file (without extension), shard index, and total number of shards. Only tests with `stage: post_merge` from that YAML file are selected when a `Post-Merge` stage runs.

## Finding the stage for a test

1. Locate the test in the appropriate YAML file under `tests/integration/test_lists/test-db/` and note its `stage` and `backend` values.
2. Search `jenkins/L0_Test.groovy` for a stage whose YAML file matches (for example `l0_a100`) and whose name contains `[Post-Merge]` if the YAML entry uses `stage: post_merge`.
3. The resulting stage name(s) are what you pass to Jenkins via the `stage_list` parameter when triggering a job.

### Using `test_to_stage_mapping.py`

Manually searching YAML and Groovy files can be tedious.  The helper script
`scripts/test_to_stage_mapping.py` automates the lookup:

```bash
python scripts/test_to_stage_mapping.py --tests "triton_server/test_triton.py::test_gpt_ib_ptuning[gpt-ib-ptuning]"
python scripts/test_to_stage_mapping.py --tests gpt_ib_ptuning
python scripts/test_to_stage_mapping.py --stages A100X-Triton-Post-Merge-1
python scripts/test_to_stage_mapping.py --test-list my_tests.txt
python scripts/test_to_stage_mapping.py --test-list my_tests.yml
```

The first two commands print the Jenkins stages that run the specified tests or
patterns. Patterns are matched by substring, so partial test names are
supported out of the box. The third lists every test executed in the given stage. When
providing tests on the command line, quote each test string so the shell does
not interpret the `[` and `]` characters as globs. Alternatively, store the
tests in a newline‑separated text file or a YAML list and supply it with
`--test-list`.


To run the same tests on your pull request, comment:

```bash
/bot run --stage-list "A100X-Triton-[Post-Merge]-1,A100X-Triton-[Post-Merge]-2"
```

This executes the same tests that run post-merge for this hardware/backend.


## Waiving tests

Sometimes a test is known to fail due to a bug or unsupported feature. Instead
of removing it from the YAML test lists, add the test name to
`tests/integration/test_lists/waives.txt`. Every CI run passes this file to
pytest via `--waives-file`, so the listed tests are skipped automatically.

Each line contains the fully qualified test name followed by an optional
`SKIP (reason)` marker. A `full:GPU_TYPE/` prefix restricts the waive to a
specific hardware family. Example:

```text
examples/test_openai.py::test_llm_openai_triton_1gpu SKIP (https://nvbugspro.nvidia.com/bug/4963654)
full:GH200/examples/test_qwen2audio.py::test_llm_qwen2audio_single_gpu[qwen2_audio_7b_instruct] SKIP (arm is not supported)
```

Changes to `waives.txt` should include a bug link or brief explanation so other
developers understand why the test is disabled.

## Triggering CI Best Practices

### Triggering Post-merge tests

When you only need to verify a handful of post-merge tests, avoid the heavy
`/bot run --post-merge` command. Instead, specify exactly which stages to run:

```bash
/bot run --stage-list "stage-A,stage-B"
```

This runs **only** the stages listed. You can also add stages on top of the
default pre-merge set:

```bash
/bot run --extra-stage "stage-A,stage-B"
```

Both options accept any stage name defined in `jenkins/L0_Test.groovy`. Being
selective keeps CI turnaround fast and conserves hardware resources.

### Avoiding unnecessary `--disable-fail-fast` usage

Avoid habitually using `--disable-fail-fast` as it wastes scarce hardware resources. The CI system automatically reuses successful test stages when commits remain unchanged, and subsequent `/bot run` commands only retry failed stages. Overusing `--disable-fail-fast` keeps failed pipelines consuming resources (like DGX-H100s), increasing queue backlogs and reducing team efficiency.

---

# Using Dev Containers

The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
configuration in `.devcontainer`. These files are intended for
use with [Visual Studio Code](https://code.visualstudio.com/).

Due to the various container options supported by TensorRT LLM (see
[](/installation/build-from-source-linux.md) and
<https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>), the Dev
Container configuration also offers some degree of customization.

Generally, the `initializeCommand` in `devcontainer.json` will run
`make_env.py` to generate an
[`.env` file for `docker-compose`](https://docs.docker.com/compose/how-tos/environment-variables/variable-interpolation/#env-file-syntax).
Most importantly, the `docker-compose.yml` uses `${DEV_CONTAINER_IMAGE}`
as base image.
The generated `.devcontainer/.env` is not tracked by Git and combines
data from the following sources:

* `jenkins/current_image_tags.properties` which contains the image tags
  currently used by CI.

* `.devcontainer/devcontainer.env` which contains common configuration
  settings and is tracked by Git.

* `.devcontainer/devcontainer.env.user` (optional) which is ignored by
  Git and can be edited to customize the Dev Container behavior.

The source files are processed using `sh`, in the order in which they
are listed above. Thus, features like command substitution are supported.

The following sections provide more detail on particular Dev Container
configuration parameters which can be customized.

```{note}
After editing any of the configuration files, it may be necessary
to execute the "Dev Containers: Reopen Folder in SSH" (if applicable) and
"Dev Containers: Rebuild and Reopen in Container" Visual Studio Code
commands.
```

## Container image selection

By default, `make_env.py` will attempt to auto-select a suitable container
image as follows:

1. Reuse the development container image used by CI. This requires access
   to the NVIDIA internal artifact repository.

1. Use the most recent
   [NGC Development container image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel)
   associated with a Git tag which is reachable from the currently checked
   out commit.

1. Build a development image locally.

Set `DEV_CONTAINER_IMAGE=<some_uri>` to bypass the aforementioned discovery
mechanism if desired.

By setting `LOCAL_BUILD=0`, the local image build can be disabled. In this
case, execution fails if no suitable pre-built image is found.

Setting `LOCAL_BUILD=1` forces building of a local image, even if a pre-built
image is available.

## Volume Mounts

[Docker volume mounts](https://docs.docker.com/engine/storage/volumes/#use-a-volume-with-docker-compose) can be customized by editing
`docker-compose.yml`, which allows using any variables defined in `.env`.

By default, the Dev Container configuration mounts the VS Code workspace into
`/workspaces/tensorrt_llm` and `~/.cache/huggingface` into `/huggingface`.
The source paths can be overridden by setting `SOURCE_DIR` and `HOME_DIR`
in `.devcontainer/devcontainer.env.user`, respectively. This is of
particular relevance when using
[Docker Rootless Mode](https://docs.docker.com/engine/security/rootless/),
which requires configuring UID/GID translation using a tool like `bindfs`.
The Dev Container scripts contain heuristics to detect Docker Rootless
Mode and will issue an error if these variables are not set.
An analogous logic is applied to `HF_HOME`.


## Overriding Docker Compose configuration

When starting the container, `.devcontainer/docker-compose.yml`
is [merged](https://docs.docker.com/compose/how-tos/multiple-compose-files/merge/) with
`.devcontainer/docker-compose.override.yml`. The latter file is not
tracked by Git and will be created by `make_env.py` if it does not exist.

This mechanism can be used, e.g., to add custom volume mounts:

```{literalinclude} /../../.devcontainer/docker-compose.override-example.yml
```

It is possible to conditionally mount volumes by combining, e.g.,
[this method] (https://stackoverflow.com/a/61954812) and shell command
substitution in `.devcontainer/devcontainer.env.user`.

If no `.devcontainer/docker-compose.override.yml` file is found, the Dev Container
initialization script will create one with the contents listed above.

---

# Introduction to KV Cache Transmission

This article provides a general overview of the components used for device-to-device transmission of KV cache, which is relied upon by dist-serving. It is intended as a reference for users who wish to understand the internal implementation or develop extended functionalities.

## Table of Contents

- [Workflow](#workflow)
- [Key Components](#key-components)
  - [Transceiver](#transceiver)
  - [Sender and Receiver](#sender-and-receiver)
  - [Formatter](#formatter)
  - [Connection](#connection)
  - [Transfer Agent](#transfer-agent)
- [Customization](#customization)
  - [Encapsulation and Overloading of Low-Level Communication Libraries](#encapsulation-and-overloading-of-low-level-communication-libraries)
  - [Modifications to Upper-Level Runtime Logic](#modifications-to-upper-level-runtime-logic)
- [Evolution Outlook](#evolution-outlook)

## Workflow

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/media/kv_transfer.png?raw=true" alt="KV Cache Transfer Overview" width="500" height="auto">

1. Context phase completes computation, KV cache stays in device memory awaiting transmission.
2. Context returns its communicator handle to the user, who selects the generation executor for continued communication.
3. If no prior connection exists, it's established now. Generation phase shares its cache layout with context.
4. Generation phase requests KV cache for specific tokens.
5. Context sends KV cache to generation phase.
6. Generation phase resumes computation, context releases KV cache.

## Key Components

### Transceiver

Responsible for coordinating the sending and receiving of cache among different ranks within the same executor.

### Sender and Receiver

Responsible for transmitting control plane messages. That is, during per-request transmission, the receiver bound to the generation informs the sender of the specific information it requires. The sender then sends the corresponding KV cache based on these messages.

### Formatter

Performs KV cache data transmission and correctly handles the mapping between caches across different TP/PP configurations.

### Connection

Bidirectional byte-stream protocol facility. Apart from essential operations such as connection establishment, it mainly provides send and receive functionalities. UCX accesses the system through this facility. The `AgentConnection` data structure adapts the upper-layer bidirectional send/receive semantics into a unidirectional read/write operation model.

### Transfer Agent

Unidirectional byte-stream read/write protocol facility. Apart from essential operations such as connection establishment, it primarily provides read and write functionalities. NIXL accesses the system through this facility.

## Customization

At the current stage, the customization work mainly involves inheriting the low-level data plane interfaces to enable the invocation of third-party communication libraries, as well as defining the data structures required for establishing connections in the data plane.

### Encapsulation and Overloading of Low-Level Communication Libraries

Each layer of interface described in the previous section supports overloading. Here, based on whether the underlying library uses a unidirectional or bidirectional protocol, we describe the customization methods respectively.

If the underlying library you are integrating uses a unidirectional communication model, with read/write as its primary interfaces, you should inherit the `executor::kv_cache::BaseTransferAgent` data structure. This structure mainly provides interfaces for memory registration, remote agent loading, and transfer request submission.

If the underlying library you are integrating uses a bidirectional communication model, you should inherit the `executor::kv_cache::Connection` data structure. This structure mainly provides send and receive interfaces.

### Modifications to Upper-Level Runtime Logic

This corresponds to the communication info section shown in the figure above. Since different underlying communication connections may require completely different setup methods—for example, some use IP and port, others require a world rank, and some communication libraries establish connections using binary-transparent metadata—we provide sufficient flexibility to allow users to customize this part as needed.

## Evolution Outlook

Currently, the architecture of KV transfer is being optimized. First, we plan to move the control plane logic up to Python to enable better integration with the Python runtime. In addition, we are reevaluating the current design choice of initiating communication only after the context computation is completed, which was originally made for flexibility. Lastly, since some control logic is still being transmitted through the data plane, we aim to clarify the relationship between the control and data planes, and to simplify and streamline the code logic of the data plane. Due to the modular architecture, these iterative enhancements are only loosely coupled with the `TransferAgent`. We aim to minimize the impact of future upgrades on third-party integrations.

---

# Architecture Overview

The `LLM` class is a core entry point for the TensorRT LLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:

```python
from tensorrt_llm import LLM

# Initialize the LLM with a specified model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Generate text using the model
output = llm.generate("Hello, my name is")
```

The `LLM` class automatically manages essential pre and post-processing steps, including tokenization (encoding input prompts into numerical representations) and detokenization (decoding model outputs back into human-readable text).

Internally, the `LLM` class orchestrates the creation of a dedicated `PyExecutor(Worker)` process on each rank.

![TensorRT LLM Architecture Overview](../media/TRTLLM_Architecture_Overview.png)

This `PyExecutor` operates in a continuous background loop, designed for the efficient, asynchronous processing of inference requests.

The `PyExecutor`'s functionality is built upon several key components:

- `Scheduler`: Responsible for determining which active requests are ready for execution at each processing step.

- `KVCacheManager`: Manages the allocation, deallocation, and maintenance of the Key-Value (KV) Cache. This is a critical optimization for Transformer models, significantly enhancing performance during autoregressive text generation by storing previously computed attention keys and values.

- `ModelEngine`: Handles the loading and highly efficient execution of the language model on the GPU hardware.

- `Sampler`: Takes the raw outputs (logits) from the ModelEngine and applies appropriate sampling strategies (e.g., greedy, top-k, top-p, beam search) to generate the final output tokens.

During each iteration of its background loop, the `PyExecutor` performs the following sequence of operations:

- Request Fetching: Retrieves new inference requests from an internal request queue, if available.

- Scheduling: Interacts with the `Scheduler` to identify and prioritize requests that are ready to be processed in the current step.

- Resource Preparation: Coordinates with the `KVCacheManager` to ensure that the necessary Key-Value (KV) Cache resources are allocated for the selected requests.

- Model Execution: Invokes the `ModelEngine` to perform a forward pass on the scheduled requests, predicting the next output tokens.

- Output Handling: Updates the partial outputs for ongoing requests and finalizes the results for any requests that have reached completion, returning them to the user.


## Runtime Optimizations

TensorRT LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.

### CUDA Graph

CUDA Graphs drastically reduce the CPU-side overhead associated with launching GPU kernels, which is particularly impactful in PyTorch-based inference where Python's host-side code can be a bottleneck. By capturing a sequence of CUDA operations as a single graph, the entire sequence can be launched with one API call, minimizing CPU-GPU synchronization and driver overhead.

To maximize the "hit rate" of these cached graphs, TensorRT LLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.

### Overlap Scheduler

The Overlap Scheduler maximizes GPU utilization by hiding CPU-bound latency behind GPU computation.

The key strategy is to launch the GPU's work for the next step (n+1) immediately, without waiting for the CPU to finish processing the results of the current step (n). This allows the CPU to handle tasks like checking stop criteria or updating responses for one batch while the GPU is already executing the model for the subsequent batch.

This concurrent execution pipeline is illustrated in the `PyExecutor`'s logic:

```python
# Schedule and launch GPU work for the current step (n)
scheduled_batch, _, _ = self._schedule()
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
sample_state = self._sample_async(scheduled_batch, batch_outputs)

# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
if self.previous_batch is not None:
    self._process_previous_batch()
```

This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRT LLM.

---

(perf-analysis)=

# Performance Analysis

NVIDIA Nsight Systems reports at the application level are highly informative. Metric sampling capabilities have increased over generations and provide a clean middle-ground between timing analysis and kernel-level deep dives with NVIDIA Nsight Compute.

Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, NVIDIA has added features to TensorRT LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.


## Feature Descriptions

The main functionality:
  * Relies on toggling the CUDA profiler runtime API on and off.
  * (PyTorch workflow only) Toggling the PyTorch profiler on and off.
  * Provides a means to understand which regions a user may want to focus on.

Toggling the CUDA profiler runtime API on and off:
  * Allows users to know specifically what the profiled region corresponds to.
  * Results in smaller files to post-process (for metric extraction or similar).

(PyTorch workflow only) Toggling the PyTorch profiler on and off:
  * Help users to analysis the performance breakdown in the model.
  * Results in smaller files to post-process (for metric extraction or similar).


## Coordinating with NVIDIA Nsight Systems Launch

Consult the Nsight Systems User Guide for full overview of options.

On the PyTorch workflow, basic NVTX markers are by default provided. On the C++/TensorRT workflow, append `--nvtx` when calling `scripts/build_wheel.py` script to compile, and clean build the code.

### Only collect specific iterations

To reduce the Nsight Systems profile size, and ensure that only specific iterations are collected, set environment variable `TLLM_PROFILE_START_STOP=A-B`, and append `-c cudaProfilerApi` to `nsys profile` command.


### Enable more NVTX markers for debugging

Set environment variable `TLLM_NVTX_DEBUG=1`.

### Enable garbage collection (GC) NVTX markers

Set environment variable `TLLM_PROFILE_RECORD_GC=1`.


### Enable GIL information in NVTX markers

Append “python-gil” to Nsys “-t” option.


## Coordinating with PyTorch profiler (PyTorch workflow only)

### Collect PyTorch profiler results

1. Set environment variable `TLLM_PROFILE_START_STOP=A-B` to specify the range of the iterations to be collected.
2. Set environment variable `TLLM_TORCH_PROFILE_TRACE=<path>`, and the results will be saved to `<path>`.

### Visualize the PyTorch profiler results

Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.


## Examples

Consult the Nsight Systems User Guide for full overview of MPI-related options.

### Profiling specific iterations on a `trtllm-bench`/`trtllm-serve` run

Say we want to profile iterations 100 to 150 on a `trtllm-bench`/`trtllm-serve` run, we want to collect as much information as possible for debugging, such as GIL, debugging NVTX markers, etc:

```bash
#!/bin/bash

# Prepare dataset for the benchmark
trtllm-bench --model ${MODEL_PATH} \
    prepare-dataset \
    --output dataset.txt \
    token-norm-dist \
    --num-requests=${NUM_SAMPLES} \
    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0

# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
  -o trace -f true \
  -t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
  --cuda-graph-trace node \
  -e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
  --trace-fork-before-exec=true \
  trtllm-bench \ # or trtllm-serve command
    --model deepseek-ai/DeepSeek-V3 \
    --model_path ${MODEL_PATH} \
    throughput \
    --dataset /tmp/dataset.txt --warmup 0 \
    --backend pytorch \
    --streaming
```

The Nsight Systems reports will be saved to `trace.nsys-rep`. Use NVIDIA Nsight Systems application to open it.

The PyTorch profiler results will be saved to `trace.json`. Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.

## MoE Expert Load Balance Analysis (Perfect Router)

For Mixture-of-Experts (MoE) models, performance can vary significantly based on how tokens are routed to experts. Uneven expert load distribution can cause some GPUs to be overloaded while others are underutilized, leading to suboptimal throughput.

TensorRT-LLM provides the `ENABLE_PERFECT_ROUTER` environment variable to help analyze and isolate expert load balancing issues from kernel performance.

### What It Does

When enabled, this feature **bypasses the learned router** and replaces it with pre-computed, perfectly load-balanced routing logits. This creates an idealized scenario where tokens are distributed evenly across all experts and GPUs.

Key behaviors:
- The learned gate/router is still computed (to maintain realistic timing)
- The gate output is **discarded** and replaced with ideal balanced logits
- Logits are pre-computed and cached for common batch sizes to minimize overhead
- Works with all MoE backends (CUTLASS, TRTLLM, TRITON)

```{warning}
This feature is for **performance analysis only**. It produces **incorrect model outputs** because the learned router decisions are discarded. Never use this in production inference.
```

### When to Use It

Use `ENABLE_PERFECT_ROUTER` when you want to:

1. **Establish performance upper bounds**: Measure the theoretical best-case MoE throughput when expert loads are perfectly balanced.

2. **Isolate routing bottlenecks**: Compare performance with vs. without perfect routing to determine if the learned router is causing load imbalance issues.

3. **Test different load balancing strategies**: Validate that MoE kernels and communication patterns behave correctly with balanced loads before implementing custom routing logic.

4. **Benchmark kernel efficiency**: Remove routing variability to get consistent, reproducible kernel performance measurements.

### How to Enable

Set the environment variable before running your workload. This works with both `trtllm-bench` and `trtllm-serve`:

```bash
export ENABLE_PERFECT_ROUTER=1
```

### Example Workflow

```bash
# Step 1: Benchmark with normal (learned) routing
trtllm-bench ...
# or
trtllm-serve ...

# Step 2: Benchmark with perfect routing (upper bound)
ENABLE_PERFECT_ROUTER=1 trtllm-bench ...
# or
ENABLE_PERFECT_ROUTER=1 trtllm-serve ...

# Step 3: Compare the throughput numbers
# If perfect router shows >10% improvement, routing imbalance is significant
```

### Interpreting Results

| Scenario | Interpretation |
|----------|----------------|
| Similar performance with/without perfect router | Router load balancing is not a bottleneck; focus optimization efforts elsewhere |
| Significant improvement with perfect router | The learned router is causing load imbalance; consider router optimization or load balancing strategies |

### Supported Models

```{note}
This feature currently requires model-specific integration. The plumbing to support perfect routing must be added to each MoE model implementation. If you need this feature for a model that doesn't yet support it, you will need to add the integration following the pattern used in existing implementations.
```

```{note}
The perfect router logits are specifically designed for `RenormalizeMoeRoutingMethod` (TopK first, then Softmax). Models using other routing methods such as `DefaultMoeRoutingMethod` or `DeepSeekV3MoeRoutingMethod` would require adapting the logit generation logic to match their routing behavior.
```

Currently supported:
- GPT-OSS (uses `RenormalizeMoeRoutingMethod`)

---

(perf-benchmarking)=

# TensorRT LLM Benchmarking


```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:

- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
- An entirely Python workflow for benchmarking.
- Ability to benchmark various flows and features within TensorRT LLM.

TensorRT LLM also provides the OpenAI-compatible API via `trtllm-serve` command, which starts an OpenAI compatible server that supports the following endpoints:
- `/v1/models`
- `/v1/completions`
- `/v1/chat/completions`

The following guidance will mostly focus on benchmarks using `trtllm-bench` CLI. To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

## Table of Contents
- [TensorRT LLM Benchmarking](#tensorrt-llm-benchmarking)
  - [Table of Contents](#table-of-contents)
  - [Before Benchmarking](#before-benchmarking)
    - [Persistence mode](#persistence-mode)
    - [GPU Clock Management](#gpu-clock-management)
    - [Set power limits](#set-power-limits)
    - [Boost settings](#boost-settings)
  - [Throughput Benchmarking](#throughput-benchmarking)
    - [Limitations and Caveats](#limitations-and-caveats)
      - [Validated Networks for Benchmarking](#validated-networks-for-benchmarking)
      - [Supported Quantization Modes](#supported-quantization-modes)
    - [Preparing a Dataset](#preparing-a-dataset)
    - [Running with the PyTorch Workflow](#running-with-the-pytorch-workflow)
      - [Benchmarking with LoRA Adapters in PyTorch workflow](#benchmarking-with-lora-adapters-in-pytorch-workflow)
      - [Running multi-modal models in the PyTorch Workflow](#running-multi-modal-models-in-the-pytorch-workflow)
      - [Quantization in the PyTorch Flow](#quantization-in-the-pytorch-flow)
  - [Online Serving Benchmarking](#online-serving-benchmarking)

To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

## Before Benchmarking

For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.

### Persistence mode

Ensure persistence mode is enabled to maintain consistent GPU state:
```shell
sudo nvidia-smi -pm 1
```

### GPU Clock Management

Allow the GPU to dynamically adjust its clock speeds based on workload and temperature. While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance. Reset GPU clocks using:
```shell
sudo nvidia-smi -rgc
```

### Set power limits

First query the maximum power limit:
```shell
nvidia-smi -q -d POWER
```
Then configure the GPU to operate at its maximum power limit for consistent performance:
```shell
sudo nvidia-smi -pl <max_power_limit>
```

### Boost settings

Potentially a GPU may support boost levels. First query available boost levels:
```shell
sudo nvidia-smi boost-slider -l
```
If supported, enable the boost slider using one of the available levels for maximum performance:
```shell
sudo nvidia-smi boost-slider --vboost <max_boost_slider>
```


## Throughput Benchmarking

### Limitations and Caveats

#### Validated Networks for Benchmarking

While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list
that have been validated extensively and is the same listing as seen on the
[Performance Overview](./perf-overview.md) page.

- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)
- [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)
- [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b)
- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
- [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
- [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)
- [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B)
- [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
- [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)
- [mistralai/Mixtral-8x7B-v0.1-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1-Instruct)

```{tip}
`trtllm-bench` can automatically download the model from Hugging Face Model Hub.
Export your token in the `HF_TOKEN` environment variable.
```

#### Supported Quantization Modes

`trtllm-bench` supports the following quantization modes:

- None (no quantization applied)
- `FP8`
- `NVFP4`

For more information about quantization, refer to [](../features/quantization.md) and
the [support matrix](../features/quantization.md#model-supported-matrix) of the supported quantization methods for each network.

```{tip}
Although TensorRT LLM supports more quantization modes than listed above, `trtllm-bench` currently only configures for
a smaller subset.
```

### Preparing a Dataset

The throughput benchmark utilizes a fixed JSON schema to specify requests. The schema is defined as follows:

| Key             | Required |     Type      | Description                                     |
| :-------------- | :------: | :-----------: | :---------------------------------------------- |
| `task_id`       |    Y     |    String     | Unique identifier for the request.              |
| `prompt`        |    N*    |    String     | Input text for a generation request.            |
| `input_ids`     |    Y*    | List[Integer] | List of logits that make up the request prompt. |
| `output_tokens` |    Y     |    Integer    | Number of generated tokens for this request.    |

```{tip}
\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`)
defined at the same time. If you specify `input_ids`, the `prompt` entry is ignored for request generation.
```

Refer to the following examples of valid entries for the benchmark:

- Entries with a human-readable prompt and no logits.

  ```json
  {"task_id": 1, "prompt": "Generate an infinite response to the following: This is the song that never ends, it goes on and on my friend.", "output_tokens": 1000}
  {"task_id": 2, "prompt": "Generate an infinite response to the following: Na, na, na, na", "output_tokens": 1000}
  ```

- Entries which contain logits.

  ```json
  {"task_id":0,"input_ids":[863,22056,25603,11943,8932,13195,3132,25032,21747,22213],"output_tokens":128}
  {"task_id":1,"input_ids":[14480,13598,15585,6591,1252,8259,30990,26778,7063,30065,21764,11023,1418],"output_tokens":128}
  ```

```{tip}
Specify each entry on one line.
To simplify passing the data, a complete JSON entry is on each line so that the benchmarker
can simply read a line and assume a complete entry. When creating a dataset, be sure that a complete
JSON entry is on every line.
```

In order to prepare a synthetic dataset, you can use the provided script in the `benchmarks/cpp`
directory. For example, to generate a synthetic dataset of 1000 requests with a uniform ISL/OSL of
128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B prepare-dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000
```

### Running with the PyTorch Workflow

To benchmark the PyTorch backend (`tensorrt_llm._torch`), use the following command with [dataset](#preparing-a-dataset) generated from previous steps. The `throughput` benchmark initializes the backend by tuning against the dataset provided via `--dataset` (or the other build mode settings described above).

Note that CUDA graph is enabled by default. You can add additional pytorch config with `--config` followed by the path to a YAML file. For more details, please refer to the help text by running the command with `--help`.

```{tip}
The command below specifies the `--model_path` option. The model path is optional and used only when you want to run a locally
stored checkpoint. When using `--model_path`, the `--model` is still required for reporting reasons and in order to look up parameters
for build heuristics.
```

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B \
  --model_path /Ckpt/Path/To/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend pytorch

# Example output
<snip verbose logging>
===========================================================
= PyTorch backend
===========================================================
Model:                  meta-llama/Llama-3.1-8B
Model Path:             /Ckpt/Path/To/Llama-3.1-8B
TensorRT LLM Version:   0.17.0
Dtype:                  bfloat16
KV Cache Dtype:         None
Quantization:           FP8

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 2048
Max Runtime Tokens:     4096
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   7.6753E+14

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
Token Throughput (tokens/sec):  20685.5510
Request Throughput (req/sec):   161.6059
Total Latency (ms):             18563.6825

```

When enabling streaming, time to first token (TTFT) and inter-token latency (ITL) metrics will also be recorded.
```shell
trtllm-bench --model meta-llama/Llama-3.1-8B \
  --model_path /Ckpt/Path/To/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend pytorch
```

Alternatively, users can benchmark the low latency mode:
```shell
trtllm-bench --model meta-llama/Llama-3.1-8B \
  --model_path /Ckpt/Path/To/Llama-3.1-8B \
  latency \
  --dataset /tmp/synthetic_128_128.txt \
  --backend pytorch
```

#### Benchmarking with LoRA Adapters in PyTorch workflow

The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters. This requires preparing a dataset with LoRA metadata and configuring the LoRA settings.

**Preparing LoRA Dataset**

Use `trtllm-bench prepare-dataset` with LoRA-specific options to generate requests with LoRA metadata:

```shell
trtllm-bench \
  --model /path/to/tokenizer \
  prepare-dataset \
  --rand-task-id 0 1 \
  --lora-dir /path/to/loras \
  token-norm-dist \
  --num-requests 100 \
  --input-mean 128 \
  --output-mean 128 \
  --input-stdev 16 \
  --output-stdev 24 \
  > synthetic_lora_data.json
```

Key LoRA options:
- `--lora-dir`: Parent directory containing LoRA adapter subdirectories named by their task IDs (e.g., `0/`, `1/`, etc.)
- `--rand-task-id`: Range of LoRA task IDs to randomly assign to requests
- `--task-id`: Fixed LoRA task ID for all requests (alternative to `--rand-task-id`)

The generated dataset will include LoRA request metadata. Below is an example of a single such request data entry:

```json
{
  "task_id": 0,
  "input_ids": [3452, 88226, 102415, ...],
  "output_tokens": 152,
  "lora_request": {
    "lora_name": "lora_0",
    "lora_int_id": 0,
    "lora_path": "/path/to/loras/0"
  }
}
```

**LoRA Configuration**

Create a `config.yaml` file with LoRA configuration:

```yaml
lora_config:
  lora_dir:
    - /path/to/loras/0
    - /path/to/loras/1
  max_lora_rank: 64
  lora_target_modules:
    - attn_q
    - attn_k
    - attn_v
  trtllm_modules_to_hf_modules:
    attn_q: q_proj
    attn_k: k_proj
    attn_v: v_proj
```

**Running LoRA Benchmark**

```shell
trtllm-bench --model /path/to/base/model \
  throughput \
  --dataset synthetic_lora_data.json \
  --backend pytorch \
  --config config.yaml
```

```{note}
The LoRA directory structure should have task-specific subdirectories named by their task IDs (e.g., `loras/0/`, `loras/1/`).
Each subdirectory should contain the LoRA adapter files for that specific task.
```

#### Running multi-modal models in the PyTorch Workflow

To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.

First, prepare the dataset:
```bash
trtllm-bench \
  --model Qwen/Qwen2-VL-2B-Instruct \
  prepare-dataset \
  --output mm_data.jsonl
  real-dataset
  --dataset-name lmms-lab/MMMU \
  --dataset-split test \
  --dataset-image-key image \
  --dataset-prompt-key question \
  --num-requests 10 \
  --output-len-dist 128,5
```
It will download the media files to `/tmp` directory and prepare the dataset with their paths. Note that the `prompt` fields are texts and not tokenized ids. This is due to the fact that
the `prompt` and the media (image/video) are processed by a preprocessor for multimodal files.

Sample dataset for multimodal:
```
{"task_id":0,"prompt":"Brahma Industries sells vinyl replacement windows to home improvement retailers nationwide. The national sales manager believes that if they invest an additional $25,000 in advertising, they would increase sales volume by 10,000 units. <image 1> What is the total contribution margin?","media_paths":["/tmp/tmp9so41y3r.jpg"],"output_tokens":126}
{"task_id":1,"prompt":"Let us compute for the missing amounts under work in process inventory, what is the cost of goods manufactured? <image 1>","media_paths":["/tmp/tmpowsrb_f4.jpg"],"output_tokens":119}
{"task_id":2,"prompt":"Tsuji is reviewing the price of a 3-month Japanese yen/U.S. dollar currency futures contract, using the currency and interest rate data shown below. Because the 3-month Japanese interest rate has just increased to .50%, Itsuji recognizes that an arbitrage opportunity exists nd decides to borrow $1 million U.S. dollars to purchase Japanese yen. Calculate the yen arbitrage profit from Itsuji's strategy, using the following data: <image 1> ","media_paths":["/tmp/tmpxhdvasex.jpg"],"output_tokens":126}
...
```

Run the benchmark:
```python
trtllm-bench --model Qwen/Qwen2-VL-2B-Instruct \
  throughput \
  --dataset mm_data.jsonl \
  --backend pytorch \
  --num_requests 10 \
  --max_batch_size 4 \
  --modality image
```


Sample output:
```
===========================================================
= REQUEST DETAILS
===========================================================
Number of requests:             10
Number of concurrent requests:  5.3019
Average Input Length (tokens):  411.6000
Average Output Length (tokens): 128.7000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 4
Max Runtime Tokens:     12288
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   1.4117E+17

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     1.4439
Total Output Throughput (tokens/sec):             185.8351
Per User Output Throughput (tokens/sec/user):     38.1959
Per GPU Output Throughput (tokens/sec/gpu):       185.8351
Total Token Throughput (tokens/sec):              780.1607
Total Latency (ms):                               6925.4963
Average request latency (ms):                     3671.8441

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 3936.3022
[Latency] P90    : 5514.4701
[Latency] P95    : 5514.4701
[Latency] P99    : 5514.4701
[Latency] MINIMUM: 2397.1047
[Latency] MAXIMUM: 5514.4701
[Latency] AVERAGE: 3671.8441

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspaces/tensorrt_llm/mm_data.jsonl
Number of Sequences:  10

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   167.0000           119.0000           300.0000
MAX:  1059.0000           137.0000          1178.0000
AVG:   411.6000           128.7000           540.3000
P50:   299.0000           128.0000           427.0000
P90:  1059.0000           137.0000          1178.0000
P95:  1059.0000           137.0000          1178.0000
P99:  1059.0000           137.0000          1178.0000
===========================================================
```

**Notes and Limitations**:
- Only image datasets are supported for now.
- `--output-len-dist` is a required argument for multimodal datasets.
- Tokenizer is unused during the prepare step but it is still a required argument.
- Since the images are converted to tokens when the model is run, `trtllm-bench` uses a default large value for the maximum input sequence length when setting up the execution settings.
  You can also modify the behavior by specifying a different value with the flag `--max_input_len` that suits your use-case.

#### Quantization in the PyTorch Flow

To run a quantized benchmark with `trtllm-bench` utilizing the PyTorch flow, you will need to use a pre-quantized
checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkpoints via HuggingFace:

- [`nvidia/Llama-3.1-8B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8)
- [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)
- [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)

To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/3_unified_hf.html).

`trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints
above:

```json
{
    "producer": {
        "name": "modelopt",
        "version": "0.23.0rc1"
    },
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": null
    }
}
```

The checkpoints above are quantized to run with a compute precision of `FP8` and default to no KV cache quantization (full
`FP16` cache). When running `trtllm-bench throughput`. The benchmark will select a KV cache quantization that is best suited
for the compute precision in the checkpoint automatically if `kv_cache_quant_algo` is specified as `null`, otherwise it will
be forced to match the specified non-null KV cache quantization. The following are the mappings that `trtllm-bench` will
follow when a checkpoint does not specify a KV cache quantization algorithm:

| Checkpoint Compute Quant | Checkpoint KV Cache Quant | `trtllm-bench` | Note |
| - | - | - | - |
| `null` | `null` | `null` | In this case, a quantization config doesn't exist. |
| `FP8` | `FP8` | `FP8` | Matches the checkpoint |
| `FP8` | `null` | `FP8` | Set to `FP8` via benchmark |
| `NVFP4` | `null` | `FP8` | Set to `FP8` via benchmark |

If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precision
when the checkpoint precision is `null`:

```yaml
kv_cache_config:
  dtype: fp8
```

```{tip}
The two valid values for `kv_cache_config.dtype` are `auto` and `fp8`.
```

## Online Serving Benchmarking

TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command, and `tensorrt_llm.serve.scripts.benchmark_serving` package to benchmark the online server. Alternatively, [AIPerf](https://github.com/ai-dynamo/aiperf) is a comprehensive benchmarking tool that can also measure the performance of the OpenAI-compatible server launched by `trtllm-serve`.

To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

---

(perf-overview)=

# Overview

This document summarizes performance measurements of TensorRT-LLM on a number of GPUs across a set of key models.

The data in the following tables is provided as a reference point to help users validate observed performance.
It should *not* be considered as the peak performance that can be delivered by TensorRT-LLM.

Not all configurations were tested for all GPUs.

We attempted to keep commands as simple as possible to ease reproducibility and left many options at their default settings.
Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situation.


For DeepSeek R1 performance, please check out our [performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)

For more information on benchmarking with `trtllm-bench` see this NVIDIA [blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).

## Throughput Measurements

The below table shows performance data where a local inference client is fed requests at an high rate / no delay between messages,
and shows the throughput scenario under maximum load. The reported metric is `Output Throughput per GPU (tokens/sec/GPU)`.

The performance numbers below were collected using the steps described in this document.

Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/Model-Optimizer/) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).

RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.


### Hardware
The following GPU variants were used for testing:
- H100 SXM 80GB (DGX H100)
- H200 SXM 141GB (DGX H200)
- B200 180GB (DGX B200)
- GB200 192GB (GB200 NVL72)
- RTX 6000 Pro Blackwell Server Edition

Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.

### FP4 Models

```text
nvidia/DeepSeek-R1-0528-NVFP4-v2
nvidia/Qwen3-235B-A22B-FP4
nvidia/Qwen3-30B-A3B-FP4
nvidia/Llama-3.3-70B-Instruct-FP4
nvidia/Llama-4-Maverick-17B-128E-Instruct-NVFP4
```

### FP8 Models

```text
deepseek-ai/DeepSeek-R1-0528
nvidia/Qwen3-235B-A22B-FP8
nvidia/Llama-3.3-70B-Instruct-FP8
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
```

# Performance Summary - All Networks

## Units

All performance values are measured in `output tokens per second per GPU`, where `output tokens` includes the first and all subsequent generated tokens (input tokens are not included).

Data in these tables is taken from the `Per GPU Output Throughput (tps/gpu)` metric reported by `trtllm-bench`.
The calculations for metrics reported by trtllm-bench can be found in the dataclasses [reporting.py](../../../tensorrt_llm/bench/dataclasses/reporting.py#L570) and [statistics.py](../../../tensorrt_llm/bench/dataclasses/statistics.py#L188)


## Table of Contents

- [Deepseek R1 0528](#deepseek-r1-0528)
- [GPT-OSS 120B](#gpt-oss-120b)
- [GPT-OSS 20B](#gpt-oss-20b)
- [LLaMA v3.3 70B](#llama-v33-70b)
  - [LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition](#llama-v33-70b-rtx-configurations)
- [LLaMA v4 Maverick](#llama-v4-maverick)
- [Qwen3 235B A22B](#qwen3-235b-a22b)
  - [Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition](#qwen3-235b-a22b-rtx-configurations)
- [Qwen3 30B A3B](#qwen3-30b-a3b)
  - [Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition](#qwen3-30b-a3b-rtx-configurations)

---

<a id="deepseek-r1-0528"></a>

# Deepseek R1 0528

| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP8 (FP8) |
|---|---|---|---|
| 1000/1000 | 6,463 | 6,939 | 1,627 |
| 1024/1024 | 6,430 | 6,924 | 1,620 |
| 1024/8192 | 3,862 | 4,379 | 1,218 |
| 1024/32768 | 1,451 | 1,465 | 438 |
| 8192/1024 | 1,168 | 1,192 | |

unit: `output tokens per second per GPU`

---

<a id="gpt-oss-120b"></a>

# GPT-OSS 120B

| Sequence Length (ISL/OSL) | B200<br/>DEP2 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP1 (FP8) | H100<br/>DEP4 (FP8) |
|---|---|---|---|---|
| 1000/1000 | 25,943 | 27,198 | 6,868 | 4,685 |
| 1024/1024 | 25,870 | 26,609 | 6,798 | 4,715 |
| 1024/8192 | 17,289 | 14,800 | 3,543 | |
| 1024/32768 | 6,279 | 5,556 | | 1,177 |
| 8192/1024 | 6,111 | 6,835 | 1,828 | 1,169 |
| 32768/1024 | 1,392 | 1,645 | 519 | 333 |

unit: `output tokens per second per GPU`

---

<a id="gpt-oss-20b"></a>

# GPT-OSS 20B

| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP1 (FP8) | H100<br/>TP1 (FP8) |
|---|---|---|---|---|
| 1000/1000 | 53,812 | 55,823 | 13,858 | 11,557 |
| 1024/1024 | 53,491 | 56,528 | 13,890 | 11,403 |
| 1024/8192 | 34,702 | 38,100 | 12,743 | 8,617 |
| 1024/32768 | 14,589 | 16,463 | | |
| 8192/1024 | 11,904 | 12,941 | 4,015 | 3,366 |
| 32768/1024 | 2,645 | 2,905 | 915 | 785 |

unit: `output tokens per second per GPU`

---

<a id="llama-v33-70b"></a>

# LLaMA v3.3 70B

| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP2 (FP8) | H100<br/>TP2 (FP8) |
|---|---|---|---|---|
| 1000/1000 | 6,920 | 7,769 | 2,587 | 2,209 |
| 1024/1024 | 6,842 | 7,751 | 2,582 | |
| 1024/8192 | 3,242 | 3,805 | 2,009 | |
| 8192/1024 | 1,362 | 1,491 | 537 | 398 |
| 32768/1024 | 274 | 302 | 120 | |

unit: `output tokens per second per GPU`

---

<a id="llama-v33-70b-rtx-configurations"></a>

# LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition

*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*

| Sequence Length (ISL/OSL) | **1 GPUs**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP1,PP2 (FP4) |
|---|---|---|
| 1000/1000 | 1,724 | 1,901 |
| 1024/1024 | 1,708 | 1,887 |
| 8192/1024 | 296 | 327 |
| 32768/1024 | | 67 |

unit: `output tokens per second per GPU`

---

<a id="llama-v4-maverick"></a>

# LLaMA v4 Maverick

| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP8 (FP8) |
|---|---|---|---|
| 1000/1000 | 11,337 | 11,828 | 4,146 |
| 1024/1024 | 11,227 | 11,905 | 4,180 |
| 1024/8192 | 5,174 | 5,508 | 1,157 |
| 1024/32768 | 2,204 | 2,300 | 679 |
| 8192/1024 | 3,279 | 3,444 | 1,276 |
| 32768/1024 | 859 | 963 | |

unit: `output tokens per second per GPU`

---

<a id="qwen3-235b-a22b"></a>

# Qwen3 235B A22B

| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP4 (FP8) | H100<br/>DEP8 (FP8) |
|---|---|---|---|---|
| 1000/1000 | 5,764 | 6,172 | 3,288 | 1,932 |
| 1024/1024 | 5,756 | 5,862 | 3,268 | 1,935 |
| 1024/8192 | 3,389 | 3,423 | 1,417 | 873 |
| 1024/32768 | 1,255 | | | |
| 8192/1024 | 1,410 | 1,464 | 627 | |
| 32768/1024 | 319 | 333 | 134 | |

unit: `output tokens per second per GPU`

---

<a id="qwen3-235b-a22b-rtx-configurations"></a>

# Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition

*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*

| Sequence Length (ISL/OSL) | **4 GPUs**<br/>DEP2,PP2 (FP4) | **8 GPUs**<br/>DEP8,PP1 (FP4) |
|---|---|---|
| 1000/1000 | 1,731 | 969 |
| 1024/1024 | 1,732 | 963 |
| 1024/8192 | 644 | 711 |
| 32768/1024 | 70 | |

unit: `output tokens per second per GPU`

---

<a id="qwen3-30b-a3b"></a>

# Qwen3 30B A3B

| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) |
|---|---|---|
| 1000/1000 | 26,971 | 22,856 |
| 1024/1024 | 26,611 | 22,201 |
| 1024/8192 | 13,497 | 14,272 |
| 1024/32768 | 4,494 | 4,925 |
| 8192/1024 | 5,735 | 6,201 |
| 32768/1024 | 1,265 | 1,380 |

unit: `output tokens per second per GPU`

---

<a id="qwen3-30b-a3b-rtx-configurations"></a>

# Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition

*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*

| Sequence Length (ISL/OSL) | **2 GPUs**<br/>DEP2,PP1 (FP4) | **4 GPUs**<br/>DEP2,PP2 (FP4) | **8 GPUs**<br/>DEP8,PP1 (FP4) | **1 GPUs**<br/>TP1,PP1 (FP4) |
|---|---|---|---|---|
| 1000/1000 | 8,409 | 7,059 | 3,985 | 9,938 |
| 1024/1024 | | 7,019 | | 9,755 |
| 1024/8192 | 3,577 | | 2,406 | 3,621 |
| 8192/1024 | | 1,416 | | 1,914 |
| 32768/1024 | | | 180 | 374 |

unit: `output tokens per second per GPU`

---


## Reproducing Benchmarked Results

```{note}
Only the models shown in the table above are supported by this workflow.
```

The following tables are references for commands that are used as part of the benchmarking process. For a more detailed description of this benchmarking workflow, see the [benchmarking suite documentation](./perf-benchmarking.md).

### Command Overview

Testing was performed using the PyTorch backend - this workflow does not require an engine to be built.

| Stage | Description | Command |
| :- | - | - |
| [Dataset](#preparing-a-dataset) | Create a synthetic dataset | `python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file` |
| [Run](#running-the-benchmark) | Run a benchmark with a dataset | `trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --config $llm_options` |

### Variables

| Name | Description |
| :- | - |
| `$isl` | Benchmark input sequence length. |
| `$osl` | Benchmark output sequence length. |
| `$tp_size` | Tensor parallel mapping degree to run the benchmark with |
| `$pp_size` | Pipeline parallel mapping degree to run the benchmark with |
| `$ep_size` | Expert parallel mapping degree to run the benchmark with |
| `$model_name` | HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory |
| `$dataset_file` | Location of the dataset file generated by `prepare_dataset.py` |
| `$num_requests` | The number of requests to generate for dataset generation |
| `$seq_len` | A sequence length of ISL + OSL |
| `$llm_options` | (optional) A yaml file containing additional options for the LLM API |

### Preparing a Dataset

In order to prepare a dataset, you can use the provided [script](source:benchmarks/cpp/prepare_dataset.py).
To generate a synthetic dataset, run the following command:

```shell
python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file
```

The command will generate a text file located at the path specified `$dataset_file` where all requests are of the same
input/output sequence length combinations. The script works by using the tokenizer to retrieve the vocabulary size and
randomly sample token IDs from it to create entirely random sequences. In the command above, all requests will be uniform
because the standard deviations for both input and output sequences are set to 0.


For each input and output sequence length combination, the table below details the `$num_requests` that were used. For
shorter input and output lengths, a larger number of messages were used to guarantee that the system hit a steady state
because requests enter and exit the system at a much faster rate. For longer input/output sequence lengths, requests
remain in the system longer and therefore require less requests to achieve steady state.


| Input Length | Output Length | Number of Requests |
|--------------|---------------|---------------------|
| 1024         | 1024          | 3000                |
| 8192         | 1024          | 1500                |
| 1024         | 8192          | 1500                |
| 32768        | 1024          | 1000                |
| 1024         | 32768         | 1000                |

### Running the Benchmark

To run the benchmark with the generated data set, simply use the `trtllm-bench throughput` subcommand. The benchmarker will
run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide
a model name (HuggingFace reference or path to a local model), a [generated dataset](#preparing-a-dataset), and a file containing any desired extra options to the LLM APIs (details in [tensorrt_llm/llmapi/llm_args.py:LlmArgs](source:tensorrt_llm/llmapi/llm_args.py)).

For dense / non-MoE models:
```shell
trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --config $llm_options
```
Llama 3.3

`llm_options.yml`
```yaml
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
```

For MoE models:

```shell
trtllm-bench --tp $tp_size --pp $pp_size --ep $ep_size --model $model_name throughput --dataset $dataset_file --backend pytorch --config $llm_options
```

GPT-OSS:

`llm_options.yml`
```yaml
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
enable_attention_dp: true
kv_cache_config:
  dtype: fp8
  # Hopper: use auto
moe_config:
  backend: CUTLASS
  # Hopper: use TRITON
```

DeepSeek R1:

`llm_options.yml`
```yaml
attention_dp_config:
  batching_wait_iters: 0
  enable_balance: true
  timeout_iters: 60
enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
moe_config:
  backend: CUTLASS
kv_cache_config:
  dtype: fp8
```

Qwen3 MoE, Llama4 Maverick:

`llm_options.yml`
```yaml
enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
```

In many cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` or lower if out-of-memory errors are encountered.

The results will be printed to the terminal upon benchmark completion. For example,

```shell
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     43.2089
Total Output Throughput (tokens/sec):             5530.7382
Per User Output Throughput (tokens/sec/user):     2.0563
Per GPU Output Throughput (tokens/sec/gpu):       5530.7382
Total Token Throughput (tokens/sec):              94022.5497
Total Latency (ms):                               115716.9214
Average request latency (ms):                     75903.4456
Per User Output Speed [1/TPOT] (tokens/sec/user): 5.4656
Average time-to-first-token [TTFT] (ms):          52667.0339
Average time-per-output-token [TPOT] (ms):        182.9639

-- Per-Request Time-per-Output-Token [TPOT] Breakdown (ms)

[TPOT] MINIMUM: 32.8005
[TPOT] MAXIMUM: 208.4667
[TPOT] AVERAGE: 182.9639
[TPOT] P50    : 204.0463
[TPOT] P90    : 206.3863
[TPOT] P95    : 206.5064
[TPOT] P99    : 206.5821

-- Per-Request Time-to-First-Token [TTFT] Breakdown (ms)

[TTFT] MINIMUM: 3914.7621
[TTFT] MAXIMUM: 107501.2487
[TTFT] AVERAGE: 52667.0339
[TTFT] P50    : 52269.7072
[TTFT] P90    : 96583.7187
[TTFT] P95    : 101978.4566
[TTFT] P99    : 106563.4497

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 78509.2102
[Latency] P90    : 110804.0017
[Latency] P95    : 111302.9101
[Latency] P99    : 111618.2158
[Latency] MINIMUM: 24189.0838
[Latency] MAXIMUM: 111668.0964
[Latency] AVERAGE: 75903.4456
```

> [!WARNING] In some cases, the benchmarker may not print anything at all. This behavior usually
means that the benchmark has hit an out of memory issue. Try reducing the KV cache percentage
using the `--kv_cache_free_gpu_mem_fraction` option to lower the percentage of used memory.

---

# LLM Common Customizations

## Quantization

TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the `LLM` instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values.

``` python
from tensorrt_llm.llmapi import QuantConfig, QuantAlgo

quant_config = QuantConfig(quant_algo=QuantAlgo.W4A16_AWQ)

llm = LLM(<model-dir>, quant_config=quant_config)
```

## Sampling

SamplingParams can customize the sampling strategy to control LLM generated responses, such as beam search, temperature, and [others](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/utils.py#L55-L76).

As an example, to enable beam search with a beam size of 4, set the `sampling_params` as follows:

```python
from tensorrt_llm.llmapi import LLM, SamplingParams, BuildConfig

build_config = BuildConfig()
build_config.max_beam_width = 4

llm = LLM(<llama_model_path>, build_config=build_config)
# Let the LLM object generate text with the default sampling strategy, or
# you can create a SamplingParams object as well with several fields set manually
sampling_params = SamplingParams(beam_width=4) # current limitation: beam_width should be equal to max_beam_width

for output in llm.generate(<prompt>, sampling_params=sampling_params):
    print(output)
```

`SamplingParams` manages and dispatches fields to C++ classes including:

* [SamplingConfig](https://nvidia.github.io/TensorRT-LLM/_cpp_gen/runtime.html#_CPPv4N12tensorrt_llm7runtime14SamplingConfigE)
* [OutputConfig](https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor12OutputConfigE)

Refer to the [class documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html#tensorrt_llm.llmapi.SamplingParams) for more details.

## Build Configuration

Apart from the arguments mentioned above, you can also customize the build configuration with the `build_config` class and other arguments borrowed from the trtllm-build CLI. These build configuration options provide flexibility in building engines for the target hardware and use cases. Refer to the following example:

```python
llm = LLM(<model-path>,
          build_config=BuildConfig(
            max_num_tokens=4096,
            max_batch_size=128,
            max_beam_width=4))
```
Refer to the [buildconfig documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/builder.py#L470-L501) for more details.

## Runtime Customization

Similar to `build_config`, you can also customize the runtime configuration with the `runtime_config`, `peft_cache_config` or other [arguments](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_utils.py#L186-L223) borrowed from the Executor APIs.  These runtime configuration options provide additional flexibility with respect to KV cache management, GPU memory allocation and so on. Refer to the following example:


```python
from tensorrt_llm.llmapi import LLM, KvCacheConfig

llm = LLM(<llama_model_path>,
          kv_cache_config=KvCacheConfig(
            free_gpu_memory_fraction=0.8))
```

## Tokenizer Customization

By default, the LLM API uses transformers’ `AutoTokenizer`. You can override it with your own tokenizer by passing it when creating the LLM object. Refer to the following example:

```python
llm = LLM(<llama_model_path>, tokenizer=<my_faster_one>)
```

The LLM() workflow should use your tokenizer instead.

It is also possible to input token IDs directly without `Tokenizers` with the following code. The code produces token IDs without text because the tokenizer is not used.

``` python
llm = LLM(<llama_model_path>)

for output in llm.generate([32, 12]):
    ...
```

### Disable Tokenizer

For performance considerations, you can disable the tokenizer by passing `skip_tokenizer_init=True` when creating `LLM`. In this case, `LLM.generate` and `LLM.generate_async` will expect prompt token ids as input. Refer to the following example:

```python
llm = LLM(<llama_model_path>)
for output in llm.generate([[32, 12]], skip_tokenizer_init=True):
    print(output)
```

You will get something like:
```python
RequestOutput(request_id=1, prompt=None, prompt_token_ids=[1, 15043, 29892, 590, 1024, 338], outputs=[CompletionOutput(index=0, text='', token_ids=[518, 10858, 4408, 29962, 322, 306, 626, 263, 518, 10858, 20627, 29962, 472, 518, 10858, 6938, 1822, 306, 626, 5007, 304, 4653, 590, 4066, 297, 278, 518, 11947, 18527, 29962, 2602, 472], cumulative_logprob=None, logprobs=[])], finished=True)
```

Note that the `text` field in `CompletionOutput` is empty since the tokenizer is deactivated.

## Generation

### Asyncio-Based Generation

With the LLM API, you can also perform asynchronous generation with the `generate_async` method. Refer to the following example:

```python
llm = LLM(model=<llama_model_path>)

async for output in llm.generate_async(<prompt>, streaming=True):
    print(output)
```

When the `streaming` flag is set to `True`, the `generate_async` method will return a generator that yields each token as soon as it is available. Otherwise, it returns a generator that wait for and yields only the final results.

### Future-Style Generation

The result of the `generate_async` method is a [Future-like](https://docs.python.org/3/library/asyncio-future.html#asyncio.Future) object, it doesn't block the thread unless the `.result()` is called.

```python
# This will not block the main thread
generation = llm.generate_async(<prompt>)
# Do something else here
# call .result() to explicitly block the main thread and wait for the result when needed
output = generation.result()
```

The `.result()` method works like the [result](https://docs.python.org/zh-cn/3/library/asyncio-future.html#asyncio.Future.result) method in the Python Future, you can specify a timeout to wait for the result.

```python
output = generation.result(timeout=10)
```

There is an async version, where the `.aresult()` is used.

```python
generation = llm.generate_async(<prompt>)
output = await generation.aresult()
```

---

# How to Change KV Cache Behavior

Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:

```python
from tensorrt_llm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(max_tokens=32)

    llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
    outputs = llm.generate(prompts, sampling_params)

    for i, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == '__main__':
    main()
```

This example runs with default KV cache properties. The default for `free_gpu_memory_fraction` is 0.9, which means TensorRT LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:

```python
from tensorrt_llm.llmapi import KvCacheConfig
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.7)
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
```

You can also set properties after you create ```KvCacheConfig```. For example:

```python
kv_cache_config = KvCacheConfig()
kv_cache_config.enable_block_reuse = False
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
```

This code disables block reuse for the quick start example.

---

# How to Change Block Priorities

You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:

```python
from tensorrt_llm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(max_tokens=32)

    llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
    outputs = llm.generate(prompts, sampling_params)

    for i, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == '__main__':
    main()
```

The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:

```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import KvCacheRetentionConfig


def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(max_tokens=32)

    llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')

    # Set priority for first 4 prompt tokens to 100. All other tokens set to default (35) priority.
    # This policy never lapses.
    tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
    kv_cache_retention_config = KvCacheRetentionConfig(
        token_range_retention_configs=[tokenRangeRetentionConfig],
        decode_retention_priority=35, # Set generated tokens to default priority
        decode_duration_ms=None)
    outputs = llm.generate(prompts, sampling_params, kv_cache_retention_config=kv_cache_retention_config)

    for i, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == '__main__':
    main()
```

This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.

---

(additional-outputs)=

# Additional Outputs

TensorRT LLM provides several options to return additional outputs from the model during inference. These options can be specified in the `SamplingParams` object and control what extra information is returned for each generated sequence.
For an example showing how to set the parameters and how to access the results, see [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py).

## Options

### `return_context_logits`

- **Description**: If set to `True`, the logits (raw model outputs before softmax) for the context (input prompt) tokens are returned for each sequence.
- **Usage**: Useful for tasks such as scoring the likelihood of the input prompt or for advanced post-processing.
- **Default**: `False`

### `return_generation_logits`

- **Description**: If set to `True`, the logits for the generated tokens (tokens produced during generation) are returned for each sequence.
- **Usage**: Enables advanced sampling, custom decoding, or analysis of the model's output probabilities for generated tokens.
- **Default**: `False`

### `prompt_logprobs`

- **Description**: If set to an integer value `N`, the top-`N` log probabilities for each prompt token are returned, along with the corresponding token IDs.
- **Usage**: Useful for analyzing how likely the model considers each input token, scoring prompts, or for applications that require access to the token-level log probability of the prompt.
- **Default**: `None`

### `logprobs`

- **Description**: If set to an integer value `N`, the top-`N` log probabilities for each generated token are returned, along with the corresponding token IDs.
- **Usage**: Useful for uncertainty estimation, sampling analysis, or for applications that require access to the probability distribution over tokens at each generation step.
- **Default**: `None` (no log probabilities returned)

### `additional_model_outputs`

- **Description**: Specifies extra outputs to return from the model during inference. This should be a list of strings, where each string corresponds to the name of a supported additional output (such as "hidden_states" or "attentions").
- **Usage**: Allows retrieval of intermediate model results like hidden states, attentions, or any other auxiliary outputs supported by the model. This can be useful for debugging, interpretability, or advanced research applications.
- **How to use**:
  - Provide a list of supported output names, e.g.:

    ```python
    additional_model_outputs=["hidden_states", "attentions"]
    ```

  - Pass this list to the `additional_model_outputs` parameter of `SamplingParams`.
  - After generation, access the results per sequence via `sequence.additional_context_outputs` (for context outputs)
  and `sequence.additional_generation_outputs` (for generation outputs).
- **Default**: `None` (no additional outputs returned)

**Note:** The available output names depend on the model implementation. The model forward function is expected to return a dictionary of model outputs including the `"logits"` and any additional output that should be attached to responses.

---

(attention)=


# Multi-Head, Multi-Query, and Group-Query Attention


This document details the implementation of multi-head attention (MHA),
multi-query attention (MQA), and group-query attention (GQA) for autoregressive
models in TensorRT LLM's PyTorch backend.

Multi-head attention involves a sequence of batched matrix multiplications, a softmax operation, and another batched matrix multiplication,
as described in the [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
[Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are
variants of MHA that use fewer KV heads than the number of query heads.
TensorRT LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`.
The following sections explain how to use these implementations and provide a brief guide on implementing new backends.


## Attention Backends


There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend.
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model.

The vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to its limited optimizations.

In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features:

1. **FP8 Quantization**: This feature enables the quantization of inputs and KV cache into FP8 format, significantly reducing memory usage and improving computational throughput.
2. **RoPE Fusion**: By integrating rotary position embedding (RoPE) directly into the attention computation, this feature enhances efficiency and reduces overhead.

The TRT-LLM backend, `TrtllmAttention`, serves as the default backend and supports all the features available in the Flashinfer backend while being further optimized for enhanced performance. It is the recommended choice for production environments. Additionally, it offers the following advanced features:

1. **Fused QKV Input**: It can accept a single QKV tensor as input, which is more efficient compared to using separate Q, K, and V tensors.
2. **FP8 Output**: It supports outputting the attention result in FP8 format, fusing quantization into the attention computation process.

## Implement a New Attention Backend

You can implement a new attention backend to integrate other attention libraries.
An attention backend consists of an `AttentionBackend` class and an `AttentionMetadata` class.
There are three stages in the PyTorch that involve the attention backend:

1. Model construction: During the model's `__init__`, call `AttentionBackend.__init__` to create an attention backend for each layer.
2. Metadata preparation: Before each forward step of the model:
   1. If the metadata is uninitialized, call `AttentionMetadata.__init__` to create the attention metadata.
   2. If using CUDA graphs, call `AttentionMetadata.create_cuda_graph_metadata` to convert the metadata to CUDA graph metadata, which pre-allocates all tensors and can be used to capture CUDA graphs. Do not re-allocate any tensors stored inside `AttentionMetadata` after the initial warmup run when using CUDA graphs.
   3. To prepare parameters of the input and KV cache, call `AttentionMetadata.prepare` to convert from existing metadata and KV cache manager.
3. Single step forward: During the forward pass of each attention layer, call `AttentionBackend.forward` to perform the attention operation. The `AttentionMetadata` will be provided as a forward argument.

### Implement `AttentionMetadata`

The `AttentionMetadata` class stores metadata from the batched input and KV cache for the attention backend.
It contains the following predefined fields:

| Field | Type | Description |
| ----- | ---- | ----------- |
| max_num_requests | int | The max number of requests in a single batch. |
| num_contexts | int | The number of context-phase sequences in the batch. |
| num_generations | int | The number of generation-phase sequences in the batch. |
| max_num_tokens | int | The max number of tokens in all requests in a single batch. |
| num_tokens | int | Number of tokens in the batch. |
| num_ctx_tokens | int | Number of tokens in sequences in the context phase. |
| kv_cache_manager | KVCacheManager | The KV cache manager. |
| is_cuda_graph | bool | Whether CUDA graph is enabled. |
| seq_lens | Tensor | The length of each sequence in the batch. The shape is (batch_size), and located on CPU memory. |
| seq_lens_cuda | Tensor | A copy of `seq_lens` store on the GPU. |
| context_lens | Tensor | The length of each context-phase sequence in the batch. The shape is (`num_contexts`). |
| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside of the backend. |
| request_ids | List[int] | The request ID of each sequence in the batch. |
| prompt_lens | List[int] | The prompt length of each sequence in the batch. |
| kv_cache_params | KVCacheParams | The parameters for the KV cache. |

During `AttentionMetadata.__init__`, you can initialize additional fields for the new attention metadata.
For example, the Flashinfer metadata initializes `decode_wrapper` here.
During `AttentionMetadata.prepare`, the runtime will fill all predefined fields, and you can fill your customized fields according to these predefined fields.
For example, the Flashinfer metadata fills `qo_indptr` by combining `context_lens` and `num_generations` here.

### Implement `AttentionBackend`

The `AttentionBackend` delegates the attention operation to the backend implementation.

Its `__init__` accepts the following arguments:

| Field | Type | Description |
| ----- | ---- | ----------- |
| layer_idx | int | The index of the attention layer in the model. |
| num_heads | int | The number of query heads. |
| head_dim | int | The size of each attention head `(hidden_size // num_heads)`. |
| num_kv_heads | Optional[int] | The number of KV heads. Defaults to num_heads if None. |
| quant_config | QuantConfig | Optional quantization configuration. If None, no quantization is applied. |
| pos_embd_params | PositionalEmbeddingParams | Optional parameters defining how positional embedding should be applied. If None, positional embedding should be applied by the model before calling the backend. Otherwise, the backend is in-charge of applying positional embedding and may cache K without embedding it first. |

Its `forward` accepts the following arguments:

| Field | Type | Description |
| ----- | ---- | ----------- |
| q | Tensor | Query tensor with shape `(num_tokens, num_heads * head_dim)`. |
| k | Tensor | Key tensor with shape `(num_tokens, num_kv_heads * head_dim)`. |
| v | Tensor | Value tensor with shape `(num_tokens, num_kv_heads * head_dim)`. |
| metadata | AttentionMetadata | Metadata for the attention operation. |
| attention_mask | AttentionMask | Optional attention mask. If None, causal mask is applied. |

For example, the Flashinfer backend calls `append_paged_kv_cache` and then `wrapper.run` to perform the attention operation here.


## The Features of the `TrtllmAttention` Backend

The following sections introduce some features of the default `TrtllmAttention` backend.

### Packed Tensors

In the `TrtllmAttention` backend, the attention operator supports the packed (i.e. non padded) QKV inputs.
A naive layout for the QKV inputs is padding the sequences
that are shorter than the `max_sequence_length` to the maximum
length. It may result in excessive memory consumption as well as unneeded
computations on padding tokens (in the various matrix multiplications that
surround the MHA block).
To overcome that problem, TensorRT LLM supports a mode without padding where
the different tokens are packed together and the user provides the operator
with a 1D tensor containing the lengths of the different sequences.

### Context and Generation Phases

The `TrtllmAttention` backend encapsulates different implementations for both
context and generation phases into a single custom torch op.

#### Context Phase

A context-phase implementation without optimization maps to a sequence of GPU kernels that will store the
intermediate `Q*K^T` tensor in memory before calling the softmax operator. It
is the slowest method and the memory footprint is significant (grows quadratically in proportion to the sequence length).

The `TrtllmAttention` backend will trigger a kernel that performs the MHA/MQA block
using a single kernel instead. For short sequences, that kernel uses a vanilla
implementation of MHA/MQA. For larger sequences, this kernel uses the Flash
Attention algorithm as described in
[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
and
[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691).

Currently, the implementation triggers extra kernels that apply pre-processing
to the elements (like RoPE) and populate the KV cache (see below). In a future
release, the number of such kernels may be reduced to improve the overall performance.

#### FP8 Context FMHA

When FP8 quantization is activated, the attention can be further accelerated by
enabling FP8 Context FMHA.

FP8 Paged Context FMHA is also supported with the fp8 quantization workflow.
You need to specify `use_paged_context_fmha = True` for the attention operator.

Please be aware that this feature is only supported on Ada, Hopper and above.

#### Generation Phase

The generation phase is implemented using a single kernel called the masked
multi-head attention in TensorRT LLM. That kernel is able to apply
pre-processing on the Q, K, and V elements on-the-fly: it adds the QKV bias, applies
RoPE, and performs dequantization and quantization. TensorRT LLM will continue to add (or
enable) additional features in future releases, such as enabling support for IA3.

The masked MHA kernel has a special version that distributes the work across
multiple CUDA thread-blocks on the GPU for cases where the GPU occupancy is
low. That mode called multi-block is always enabled.
NVIDIA recommends users to test that mode in scenarios where both the batch
size and the number of heads in the model are relatively small.
The definition of 'small' in that context is hard to quantify because it depends on the model of the GPU.
However, NVIDIA currently recommends testing that mode when `batch_size * num_heads` is less than the number of multi-processors on the GPU.
This guidance may be subject to change in the future.

Note that even if the multi-block mode is enabled, the attention operator will
not immediately trigger the multi-block version of the GPU kernel. There is a
minimum number of tokens (input + generated) that are required for the
multi-block version to become more efficient than the "vanilla" implementation
that uses a single CUDA thread-block per head. It is controlled by an internal
heuristic.

Another note is that as the masked MHA kernels use shared memory size
proportional to sequence length, so there can be some cases that GPU's shared
memory is not enough when multi-block mode is not enabled. To get masked MHA kernel to work in those cases, multi-block mode is forced on and a warning message is printed in the log.

#### XQA Optimization

XQA optimization is another optimization for MQA/GQA in the generation phase.
It currently only supports a limited number of model configurations, such as the LLAMA2 70B model.

Support matrix of the XQA optimization:
 - FP16 / BF16 compute data type.
 - FP16 / BF16 / FP8 / INT8 KV cache data type.
 - Paged KV cache (8 / 16 / 32 / 64 / 128 tokens per block).

By default, this is enabled. Note that a heuristic algorithm
is also used to decide whether to use XQA kernel or masked MHA kernel to get
better performance.
If you want to use that kernel whenever possible, set `TRTLLM_FORCE_XQA=1` to force use of the XQA kernel when the model config is supported.
Supported configurations can be found using the `shouldUse` function of the `DecoderXQARunner` class in
`cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h`.


(inflight-batching)=

### In-flight Batching

TensorRT LLM supports in-flight batching of requests (also known as continuous
batching or iteration-level batching) for higher serving throughput. With this feature,
sequences in the context phase can be processed together with sequences in
the generation phase. The purpose of that technique is to better interleave
requests to reduce latency as well as make better use of the GPUs.
For efficiency reasons (1), the support for inflight batching ***requires the
input tensors to be packed (no padding)***.

***In the current implementation, the sequences that are going through the
context phase must be before the sequences in the generation phase in the input
tensor. For example, for sequences `S0`, `S1` and `S2`, if `S0` and `S2` are in
context phase (and `S1` in generation), tokens from `S0` and `S2` must appear
before the tokens of `S1` in the input tensor***.

_(1) Padding sequences in the generation phase, that contain a single token, to
the length of the maximum input sequence is inefficient use of resources_.


### Chunked Context

In the original state, the common behavior was to process all context tokens at
once. This feature splits the context into several chunks. In this way, the
context chunks can be batched with more tokens during the generation phase,
which is expected to increase the total throughput. Chunking contexts also removes
constraints on input length. Except for the last one, the size of the context chunk needs
to be an integer multiple of the kv-cache block size.

> To enable this feature, the FMHA paged kv-cache also needs to be enabled.

### KV Cache

In the generation phase, a common optimization is to provide the MHA kernel
with a cache containing the values of the past K and V elements that have
already been computed.  That cache is known as the KV cache. TensorRT LLM uses
that technique to accelerate its generation phase. In TensorRT LLM, there is
one KV cache per Transformer layer, which means that there are as many KV
caches as layers in a model. The current version of TensorRT LLM supports two
different types of KV caches: **contiguous** and **paged** KV caches.

#### Contiguous KV Cache

The contiguous KV cache is a monolithic tensor. Its shape is:
```
[max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].
```

That implementation uses a lot more memory than needed when the sequences are
shorter than the maximum sequence length (even if they end up close to the
limit after the generation of many output tokens, it may take a lot of steps to
reach that point).

#### Paged KV Cache

The paged KV cache decomposes the KV cache into blocks that are distributed to
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocates new blocks from a pool and recycles those
blocks when required. See the implementation of
[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).

#### INT8/FP8 KV Caches

In its current implementation, even if the rest of the network runs in INT8 or
FP8, the attention operator works with FP32, FP16, and BFloat16 inputs and
outputs. However, TensorRT LLM supports INT8 and FP8
(`QuantMode.INT8_KV_CACHE` and
`QuantMode.FP8_KV_CACHE`) KV caches.

The attention operator populates the KV cache. When INT8 or FP8 KV caches
are enabled, the input values have to be quantized to 8 bits using a scaling
factor. For quantization, the scaling factor is stored in the
`kv_cache_scaling_factor` tensor. Its shape is `[1]` and only per-tensor
quantization is supported in the current version. Quantization uses inversed scale
since it does multiply as `fp_value * (1.0 / kv_cache_scaling_factor)` in plugin.

During generation, the values read from the cache are dequantized on-the-fly in
the MHA/MQA kernel. Dequantization is defined as
`quantized_value * kv_cache_scaling_factor`.


### Sliding Window Attention, Cyclic (Rolling Buffer) KV Cache

TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the kv cache
as a circular buffer. This means that it only stores the kv cache for the last N
tokens, where N is determined by the `attention_window_size` parameter in
`TrtllmAttention.forward`. When the cache is full, new tokens’ kv cache will
overwrite the "least recently used" caches.

In the context phase, if the input length surpasses the `attention_window_size`,
`Sliding Window Attention` will be activated. This serves the same function as
the sliding window size.

This feature helps to reduce the memory footprint of the kv cache when
dealing with very long sequences.

_Note that the cyclic kv cache feature doesn't work with beam searching currently as
the context kv cache are shared across beams.

### StreamingLLM

The StreamingLLM feature uses a window attention to perform efficient and stable LLM
on long texts, which means that only `N` tokens need to be stored in the KV cache.
Similar to the cyclic KV cache feature in TensorRT LLM, `attention_window_size`
parameter is used to determine `N`. Different from the cyclic KV cache feature,
the first `S` tokens, called sink tokens, are always kept in the attention window,
where `S` is determined by `sink_token_length` parameter.
But in context phase, the self-attentions are dense in the official implementation of
StreamingLLM. It uses all of the tokens for computation and only saves `N` tokens
to the KV cache.

In addition, the relative position embedding is also changed in StreamingLLM.
When determining the relative distance and adding positional information to tokens,
StreamingLLM use the positions within the cache rather than those in the original text.

`sink_token_length` is also used to enable this feature.

### Beam-Search

The attention operator supports beam-search. In the context phase, a single
beam is computed per input sequence. In the generation phase, the MHA/MQA/GQA
kernel uses an additional tensor to reconstruct the correct path for each beam.
That tensor is called the `cache_indirection`. Its shape is `[batch_size,
beam_width, max_seqlen]`.

For a sequence `si`, a beam `bi` and a token `ti`, the element
`cache_indirection[si][bi][ti]` is an integer between `0` and `beam_width-1`
that indicates which path in the beam to read the K and V elements from in the
KV cache. This tensor is populated in the sampling stage.

### Input QKV tensor

The input QKV tensor packs the Q, K and V tensors (concatenated along the last
dimension) after the projection of the hidden states. It is a 3D tensor. RoPE
and quantization to INT8 or FP8 (when needed) are performed by the GPT
attention operator.

In packed mode, its shape is `[num_tokens, 3 * hidden_dim]` where
`num_tokens` is the total number of tokens in the batch. For the sequences in
context phase, the number of tokens of a sequence corresponds to its input
length (even if the beam width is greater than `1` for beam search). For the
sequences in generation phase, there are `beam_width` tokens per sequence. The
beam width can be different for each sequence.

The following pseudo code explains how the number of tokens is computed:

```python
num_tokens = 0

# Add the length of each sequence in context phase.
for seq in context_phase:
    num_tokens += seq.length

# Add the width of the beam for each sequence in generation phase.
for seq in generation_phase:
    num_tokens += seq.beam_width
```

### Rotary Positional Embedding (RoPE)

The attention operator can perform the computation of the Rotary
Positional Embedding (RoPE). When that operation is enabled,
`rotary_embedding_dim` is set to a value greater than 0, it is fused with other
operations. The GPT operator supports GPT-NeoX and GPT-J forms of RoPE by
setting `position_embedding_type` to `PositionEmbeddingType.rope_gpt_neox`
or `PositionEmbeddingType.rope_gptj`.

### ALiBi

The attention operator can apply ALiBi to the result of the `Q*K^T`
product. The bias is computed on-the-fly from the ALiBi slopes in the optimized
kernel.

### Scaling factor(s)

In MHA, the output of the `Q*K^T` product is scaled by a constant value that
is computed as:

```
norm_factor = 1.f / (q_scaling * sqrt(head_size)).
```

### Cross Attention

On top of the MHA as self attention needed by GPT-style decoder-only models, the attention operator also supports cross attention.

This enables the attention operator to be more broadly used as a generic decoder component. For example, the Encoder-Decoder model uses it to issue both the self attention and cross attention modules in its Decoder.

---

# Benchmarking with trtllm-bench

AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.

## Getting Started

Before benchmarking with AutoDeploy, review the [TensorRT LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.

## Basic Usage

Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:

```bash
trtllm-bench \
  --model meta-llama/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend _autodeploy
```

```{note}
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
```

## Advanced Configuration

For more granular control over AutoDeploy's behavior during benchmarking, use the `--config` flag with a YAML configuration file:

```{eval-rst}
.. include:: ../../../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

```bash
trtllm-bench \
  --model meta-llama/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend _autodeploy \
  --config autodeploy_config.yaml
```

### Configuration Examples

#### Basic Performance Configuration (`autodeploy_config.yaml`)

```yaml
# runtime engine
runtime: trtllm

# model loading
skip_loading_weights: false

# Sequence configuration
max_batch_size: 256

# transform options
transforms:
  insert_cached_attention:
    # attention backend
    backend: flashinfer
  resize_kv_cache:
    # fraction of free memory to use for kv-caches
    free_mem_ratio: 0.8
  compile_model:
    # compilation backend
    backend: torch-opt
    # CUDA Graph optimization
    cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs.

## Configuration Options Reference

### Core Performance Settings

| Parameter | Default | Description |
|-----------|---------|-------------|
| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |

### CUDA Graph Optimization

| Parameter | Default | Description |
|-----------|---------|-------------|
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |

```{tip}
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
```

## Performance Optimization Tips

1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
1. **Compilation Backend**: Use `torch-opt` for production workloads
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.

---

# Example Run Script

To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:

```bash
cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

You can configure your experiment with various options. Use the `-h/--help` flag to see available options:

```bash
python build_and_run_ad.py --help
```

The following is a non-exhaustive list of common configuration options:

| Configuration Key | Description |
|-------------------|-------------|
| `--model` | The HF model card or path to a HF checkpoint folder |
| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
| `--args.compile-backend` | Specifies how to compile the graph at the end |
| `--args.attn-backend` | Specifies kernel implementation for attention |
| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
| `--args.max-seq-len` | Maximum sequence length for inference/cache |
| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
| `--args.attn-page-size` | Page size for attention |
| `--prompt.batch-size` | Number of queries to generate |
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |

For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.

The following is a more complete example of using the script:

```bash
cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True
```

---

# Expert Configuration of LLM API

For advanced TensorRT LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TensorRT LLM argument list.

- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
  Please make sure to refer to those first.
- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
- Note that some fields in the full `LlmArgs`
  object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
  pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
  significantly differs from the default manual workflow in TensorRT-LLM.
- However, with the proper care the full `LlmArgs`
  objects can be used to configure advanced runtime options in TensorRT-LLM.
- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.

# Expert Configuration of `build_and_run_ad.py`

For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.

## CLI Arguments with Dot Notation

The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:

```bash
# Configure model parameters
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
# specified as CLI arg
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --args.model-kwargs.num-hidden-layers=10 \
  --args.model-kwargs.hidden-size=2048 \
  --args.tokenizer-kwargs.padding-side=left

# Configure runtime and backend options
python build_and_run_ad.py \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --args.world-size=2 \
  --args.compile-backend=torch-opt \
  --args.attn-backend=flashinfer

# Configure prompting and benchmarking
python build_and_run_ad.py \
  --model "microsoft/phi-4" \
  --prompt.batch-size=4 \
  --prompt.sp-kwargs.max-tokens=200 \
  --prompt.sp-kwargs.temperature=0.7 \
  --benchmark.enabled=true \
  --benchmark.bs=8 \
  --benchmark.isl=1024
```

## YAML Configuration Files

Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.

Create a YAML configuration file (e.g., `my_config.yaml`):

```yaml
# my_config.yaml
args:
  model_kwargs:
    num_hidden_layers: 12
    hidden_size: 1024
  world_size: 4
  max_seq_len: 2048
  max_batch_size: 16
  transforms:
    detect_sharding:
      support_partial_config: true
    insert_cached_attention:
      backend: triton
    compile_model:
      backend: torch-compile

prompt:
  batch_size: 8
  sp_kwargs:
    max_tokens: 150
    temperature: 0.8
    top_k: 50
```

Create an additional override file (e.g., `production.yaml`):

```yaml
# production.yaml
args:
  world_size: 8
  max_batch_size: 32
  transforms:
    compile_model:
      backend: torch-opt
```

Then use these configurations:

```bash
# Using single YAML config
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml

# Using multiple YAML configs (deep merged in order, later files have higher priority)
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml production.yaml

# Targeting nested AutoDeployConfig with separate YAML
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml \
  --args.yaml-extra autodeploy_overrides.yaml
```

## Configuration Precedence and Deep Merging

The configuration system follows a precedence order in which higher priority sources override lower priority ones:

1. **CLI Arguments** (highest priority) - Direct command line arguments
1. **YAML Configs** - Files specified via `--yaml-extra` and `--args.yaml-extra`
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes

**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:

```yaml
# Base config
args:
  model_kwargs:
    num_hidden_layers: 10
    hidden_size: 1024
  max_seq_len: 2048
```

```yaml
# Override config
args:
  model_kwargs:
    hidden_size: 2048  # This will override
    # num_hidden_layers: 10 remains unchanged
  world_size: 4  # This gets added
```

**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:

```bash
# The outer yaml-extra affects the entire ExperimentConfig
# The inner args.yaml-extra affects only the AutoDeployConfig
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra experiment_config.yaml \
  --args.yaml-extra autodeploy_config.yaml \
  --args.world-size=8  # CLI override beats both YAML configs
```

## Built-in Default Configuration

Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.

The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:

```bash
# View the default configuration
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml

# Override specific transform settings
python build_and_run_ad.py \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --args.transforms.export-to-gm.strict=true
```

---

# Logging Level

Use the following env variable to specify the logging level of our built-in logger, ordered by
decreasing verbosity;

```bash
AUTO_DEPLOY_LOG_LEVEL=DEBUG
AUTO_DEPLOY_LOG_LEVEL=INFO
AUTO_DEPLOY_LOG_LEVEL=WARNING
AUTO_DEPLOY_LOG_LEVEL=ERROR
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
```

The default log level is `INFO`.

---

### Incorporating `auto_deploy` into your own workflow

AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.

The following example demonstrates how to build an LLM object with AutoDeploy integration:

```
from tensorrt_llm._torch.auto_deploy import LLM


# Construct the LLM high-level interface object with autodeploy as backend
llm = LLM(
    model=<HF_MODEL_CARD_OR_DIR>,
    world_size=<DESIRED_WORLD_SIZE>,
    compile_backend="torch-compile",
    model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
    attn_backend="flashinfer", # choose between "triton" and "flashinfer"
    attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
    skip_loading_weights=False,
    model_factory="AutoModelForCausalLM", # choose appropriate model factory
    free_mem_ratio=0.8, # fraction of available memory for cache
    max_seq_len=<MAX_SEQ_LEN>,
    max_batch_size=<MAX_BATCH_SIZE>,
)

```

For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.

---

# AutoDeploy (Prototype)

```{note}
This project is under active development and is currently in a prototype stage. The code is a prototype, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.
```

## Seamless Model Deployment from PyTorch to TensorRT LLM

AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models such as those from the Hugging Face Transformers library, to TensorRT LLM.

![AutoDeploy overview](../../media/ad_overview.png)
<sub><em>AutoDeploy overview and relation with TensorRT LLM's LLM API</em></sub>

AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.

## Key Features

- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TensorRT LLM without manual rewrites.
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
- **Quick Setup & Prototyping:** Lightweight pip package for easy installation with a demo environment for fast testing.

## Get Started

1. **Install AutoDeploy:**

AutoDeploy is included with the TRT-LLM installation.

```bash
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
```

You can refer to [TRT-LLM installation guide](../../installation/linux.md) for more information.

2. **Run Llama Example:**

You are now ready to run an in-framework LLama Demo.

The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:

```bash
cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

## Support Matrix

AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.

The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT LLM runtime.

- [Support Matrix](support_matrix.md)

## Advanced Usage

- [Example Run Script](./advanced/example_run.md)
- [Logging Level](./advanced/logging.md)
- [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
- [Expert Configurations](./advanced/expert_configurations.md)
- [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)

## Roadmap

We are actively expanding AutoDeploy to support a broader range of model architectures and inference features.

**Upcoming Model Support:**

- Vision-Language Models (VLMs)

- Structured State Space Models (SSMs) and Linear Attention architectures

**Planned Features:**

- Low-Rank Adaptation (LoRA)

- Speculative Decoding for accelerated generation

To track development progress and contribute, visit our [Github Project Board](https://github.com/orgs/NVIDIA/projects/83/views/13).
We welcome community contributions, see `examples/auto_deploy/CONTRIBUTING.md` for guidelines.

---

## Support Matrix

AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.

The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.

### Support Models

**Bring Your Own Model**: AutoDeploy leverages `torch.export` and dynamic graph pattern matching, enabling seamless integration for a wide variety of models without relying on hard-coded architectures.

AutoDeploy supports Hugging Face models compatible with `AutoModelForCausalLM` and `AutoModelForImageTextToText`.
In addition, the following models have been officially validated using the default configuration: `runtime=trtllm`, `compile_backend=torch-compile`, and `attn_backend=flashinfer`

<details>
<summary>Click to expand supported models list</summary>

- Qwen/QwQ-32B
- Qwen/Qwen2.5-0.5B-Instruct
- Qwen/Qwen2.5-1.5B-Instruct
- Qwen/Qwen2.5-3B-Instruct
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen3-0.6B
- Qwen/Qwen3-235B-A22B
- Qwen/Qwen3-30B-A3B
- Qwen/Qwen3-4B
- Qwen/Qwen3-8B
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
- apple/OpenELM-1_1B-Instruct
- apple/OpenELM-270M-Instruct
- apple/OpenELM-3B-Instruct
- apple/OpenELM-450M-Instruct
- bigcode/starcoder2-15b-instruct-v0.1
- bigcode/starcoder2-7b
- deepseek-ai/DeepSeek-Prover-V1.5-SFT
- deepseek-ai/DeepSeek-Prover-V2-7B
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- google/codegemma-7b-it
- google/gemma-1.1-7b-it
- google/gemma-2-27b-it
- google/gemma-2-2b-it
- google/gemma-2-9b-it
- google/gemma-2b
- google/gemma-3-1b-it
- ibm-granite/granite-3.1-2b-instruct
- ibm-granite/granite-3.1-8b-instruct
- ibm-granite/granite-3.3-2b-instruct
- ibm-granite/granite-3.3-8b-instruct
- ibm-granite/granite-guardian-3.1-2b
- ibm-granite/granite-guardian-3.2-5b
- meta-llama/CodeLlama-34b-Instruct-hf
- meta-llama/CodeLlama-7b-Instruct-hf
- meta-llama/CodeLlama-7b-Python-hf
- meta-llama/Llama-2-13b-chat-hf
- meta-llama/Llama-2-7b-chat-hf
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- meta-llama/Llama-4-Maverick-17B-128E-Instruct
- meta-llama/Llama-4-Scout-17B-16E-Instruct
- microsoft/Phi-3-medium-128k-instruct
- microsoft/Phi-3-medium-4k-instruct
- microsoft/Phi-4-mini-instruct
- microsoft/Phi-4-mini-reasoning
- microsoft/Phi-4-reasoning
- microsoft/Phi-4-reasoning-plus
- microsoft/phi-4
- mistralai/Codestral-22B-v0.1
- mistralai/Mistral-7B-Instruct-v0.2
- mistralai/Mistral-7B-Instruct-v0.3
- mistralai/Mixtral-8x22B-Instruct-v0.1
- nvidia/Llama-3.1-405B-Instruct-FP8
- nvidia/Llama-3.1-70B-Instruct-FP8
- nvidia/Llama-3.1-8B-Instruct-FP8
- nvidia/Llama-3.1-Minitron-4B-Depth-Base
- nvidia/Llama-3.1-Minitron-4B-Width-Base
- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
- nvidia/Llama-3.1-Nemotron-Nano-8B-v1
- nvidia/Llama-3_1-Nemotron-51B-Instruct
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
- nvidia/Mistral-NeMo-Minitron-8B-Base
- nvidia/Nemotron-Flash-3B-Instruct
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
- perplexity-ai/r1-1776-distill-llama-70b

</details>

### Runtime Integrations

AutoDeploy runs natively with the complete `TRT-LLM` stack via the `LLM` API. In addition, we provide a light-weight wrapper of the `LLM` API for onboarding and debugging new models:

| `"runtime"` | Description |
|-------------|-------------|
| `trtllm`    | A robust, production-grade runtime optimized for high-performance inference. |
| `demollm`   | A lightweight runtime wrapper designed for development and testing, featuring a naive scheduler and KV-cache manager for simplified debugging and testing. |

### Compile Backends

AutoDeploy supports multiple backends for compiling the exported Torch graph:

| `"compile_backend"` | Description |
|--------------------|-------------|
| `torch-simple`     | Exports the graph without additional optimizations. |
| `torch-compile`    | Applies `torch.compile` to the graph after all AutoDeploy transformations have been completed. |
| `torch-cudagraph`  | Performs CUDA graph capture (without torch.compile). |
| `torch-opt`        | Uses `torch.compile` along with CUDA Graph capture to enhance inference performance. |

### Attention backends

Optimize attention operations with different attention kernel implementations:

| `"attn_backend"` | Description |
|----------------------|-------------|
| `torch`  | Custom fused multi-head attention (MHA) with KV Cache reference implementation in pure PyTorch (slow!) |
| `triton` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
| `flashinfer`         | Uses optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |

### Precision Support

AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`Model-Optimizer`](https://github.com/NVIDIA/Model-Optimizer).

**Supported precision types include:**

- BF16 / FP16 / FP32
- FP8
- [NVFP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)

---

# Checkpoint Loading

The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different formats, such as HuggingFace (HF). This system allows you to load models from various sources (e.g., HuggingFace or custom formats) by implementing the required components, such as the checkpoint’s weight loader, mapper, and configuration parser.

## Table of Contents
1. [Overview](#overview)
2. [Core Components](#core-components)
3. [Built-in Checkpoint Formats](#built-in-checkpoint-formats)
4. [Using Checkpoint Loaders](#using-checkpoint-loaders)
5. [Creating Custom Checkpoint Loaders](#creating-custom-checkpoint-loaders)

## Overview

The checkpoint loading design is built around a plugin-like architecture that is separated into four distinct components:

- **Checkpoint Loaders**: Orchestrate the loading process for specific formats
- **Config Loaders**: Handle model configuration parsing and validation
- **Weight Loaders**: Manage the actual loading of model weights from storage into memory
- **Weight Mappers**: Map and transform loaded weights to TensorRT LLM model's definition

This modular design allows for easy extension to support new checkpoint formats while maintaining backward compatibility and performance optimizations. By separating the checkpoint loading components into four different subcomponents, any user can employ any relevant previous work while also introducing their own custom checkpoint-specific components.

If one wishes to support a new checkpoint format, they must implement all four components.
Likewise, if the format shares some components with an already supported framework (e.g., HF), only the custom-specific components need to be implemented.

## Core Components

### BaseCheckpointLoader

The `BaseCheckpointLoader` is the central base interface for all checkpoint loading required operators. It provides a unified API regardless of the underlying checkpoint format. This interface is responsible for holding and exposing all objects required for the loading and parsing process.

**Key Methods:**
- `load_config(checkpoint_dir, **kwargs)`: Loads and returns a `ModelConfig` object
- `load_weights(checkpoint_dir, mapping, **kwargs)`: Loads and returns a dictionary of weights
- `get_initialized_weight_mapper(model, config)`: Returns a runtime initialized weight mapper for the model
- `cleanup()`: Releases resources and cleans up internal state

### BaseConfigLoader

Responsible for loading model configurations from checkpoint directories and parsing them into TRTLLM `ModelConfig`:

```python
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader

class CustomConfigLoader(BaseConfigLoader):
    def load(self, checkpoint_dir: str, **kwargs) -> ModelConfig:
        # Load and parse configuration from your custom format
        pretrained_config = self._get_pretrained_config(checkpoint_dir, **kwargs)

        return ModelConfig(pretrained_config=pretrained_config,
                            ...)

    def _get_pretrained_config(self, checkpoint_dir, **kwargs):
        ...

```

### BaseWeightLoader

Handles the loading of model weights from storage:

```python
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader

class CustomWeightLoader(BaseWeightLoader):
    def load_weights(self, checkpoint_dir: str, mapping: Mapping) -> dict[str, Any]:
        # Load weights from your custom format
        # Return a dictionary mapping parameter names to tensors
        return weights_dict
```

### BaseWeightMapper

Transforms weights between different naming conventions and applies model-specific transformations into TRTLLM model's object.

## Built-in Checkpoint Formats

### HuggingFace Format

Currently, HF checkpoint loader is the primary built-in format, supporting:

- **Weights loading** (`.safetensors/.bin/.pth`) - Loading HF compatible weights from disk
- **Configuration parser** - Parsing HF stored configuration information to TRTLLM `ModelConfig` object
- **Weights Mapping** - Converting HF weights into TRTLLM compatible representation

## Using Checkpoint Loaders

### Basic Usage

There are two main approaches to trigger the use of checkpoint loading objects.

The first approach, through llm-api, as shown in the following example:

```python
from tensorrt_llm import LLM

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir)
```

In this example, `HfCheckpointLoader` will be selected by default.

To explicitly set the checkpoint loader, you need to call the required checkpoint-specific loader

```python
from tensorrt_llm import LLM
from tensorrt_llm._torch.models.checkpoints.hf.checkpoint_loader import HfCheckpointLoader

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir,
          checkpoint_loader=HfCheckpointLoader())
```

Similarly, if one wants to use a basic implemented checkpoint loader, but with a specific subcomponent, they can provide any specific subcomponent upon need

```python
from tensorrt_llm import LLM
from tensorrt_llm._torch.models.checkpoints.hf.checkpoint_loader import HfCheckpointLoader

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir,
          checkpoint_loader=HfCheckpointLoader(weight_loader=MyCustomWeightLoader()))
```

In the second approach, one can directly use the components of the checkpoint loading.

```python
from tensorrt_llm._torch.models.checkpoints.hf.gemma3_weight_mapper import \
    Gemma3HfWeightMapper
from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM

gemma3 = Gemma3ForCausalLM(model_config)
weight_mapper = Gemma3HfWeightMapper()
weight_mapper.init_model_and_config(gemma3, model_config)
gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
```
## Creating Custom Checkpoint Loaders

To support a new checkpoint format, you need to implement all four components. This section provides minimal templates for each component.

### When to Create Custom Components

- **Complete New Format**: Implement all four components when supporting a completely new checkpoint format
- **Custom Weight Storage**: Only implement a custom weight loader if you have a unique weight storage format (e.g., custom binary format, database storage, etc.)
- **Custom Configuration**: Only implement a custom config loader if your configuration format cannot be parsed by existing parsers.
- **Custom Weight Mapping**: Only implement a custom weight mapper if your model has unique weight naming or transformation requirements that are checkpoint-specific.

### Step 1: Create the Checkpoint Loader

```python
from typing import Optional
from tensorrt_llm._torch.models.checkpoints.base_checkpoint_loader import BaseCheckpointLoader
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
from tensorrt_llm._torch.models.checkpoints.base_weight_mapper import BaseWeightMapper
from tensorrt_llm._torch.models.modeling_utils import register_checkpoint_loader

@register_checkpoint_loader("CUSTOM_FORMAT")
class CustomCheckpointLoader(BaseCheckpointLoader):
    def __init__(self,
                 *,
                 weight_loader: Optional[BaseWeightLoader] = None,
                 weight_mapper: Optional[BaseWeightMapper] = None,
                 config_loader: Optional[BaseConfigLoader] = None):
        self._weight_loader = weight_loader or self.get_default_weight_loader()
        self._config_loader = config_loader or self.get_default_config_loader()
        self._weight_mapper = weight_mapper
        self._checkpoint_format = "CUSTOM_FORMAT"

    def get_default_weight_loader(self) -> BaseWeightLoader:
        return CustomWeightLoader()

    def get_default_config_loader(self) -> BaseConfigLoader:
        return CustomConfigLoader()
```

### Step 2: Create the Checkpoint Weight Loader

```python
from typing import Any
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
from tensorrt_llm._torch.models.modeling_utils import register_checkpoint_weight_loader

@register_checkpoint_weight_loader("CUSTOM_FORMAT")
class CustomWeightLoader(BaseWeightLoader):
    def load_weights(self, checkpoint_dir: str, mapping: Mapping, **kwargs) -> dict[str, Any]:
        """
        Load weights from your custom format.
        Args:
            checkpoint_dir: Directory containing checkpoint files
            mapping: A mapping object containing the distributed configuration.
            **kwargs: Additional loading parameters
        Returns:
            Dictionary mapping parameter names to tensors
        """
        weights = {}

        # Implement your custom weight loading logic here
        # Examples:
        # - Load from custom binary files
        # - Load from databases
        # - Load from compressed archives
        # - Apply custom preprocessing

        return weights
```

### Step 3: Create the Checkpoint Config Loader

```python
from tensorrt_llm._torch.model_config import ModelConfig
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader
from tensorrt_llm._torch.models.modeling_utils import register_config_loader

@register_config_loader("CUSTOM_FORMAT")
class CustomConfigLoader(BaseConfigLoader):
    def load(self, checkpoint_dir: str, **kwargs) -> ModelConfig:
        """
        Load and parse configuration from your custom format.
        Args:
            checkpoint_dir: Directory containing configuration files
            **kwargs: Additional loading parameters
        Returns:
            ModelConfig object containing parsed configuration
        """
        # Load your custom configuration format
        # Examples:
        # - Parse YAML/TOML files
        # - Convert from proprietary formats

        pretrained_config = self._load_pretrained_config(checkpoint_dir, **kwargs)

        return ModelConfig(
            pretrained_config=pretrained_config,
            # Add other ModelConfig parameters as needed
        )

    def _load_pretrained_config(self, checkpoint_dir: str, **kwargs):
        """Load the raw configuration from your custom format."""
        pass
```

### Step 4: Create the Checkpoint Weight Mapper

```python
from torch import nn
from tensorrt_llm._torch.models.checkpoints.base_weight_mapper import BaseWeightMapper
from tensorrt_llm._torch.models.modeling_utils import register_mapper

@register_mapper("CUSTOM_FORMAT")
class CustomWeightMapper(BaseWeightMapper):
    def __init__(self):
        super().__init__()
        # Define any weight transformation callbacks
        self._callbacks = [
            # Add your custom weight transformation functions
            # self._custom_transform_function,
        ]

    def map_weights(self) -> None:
        """
        Define mappings between source and target weight names.
        """
        self.mapping.update({
            # Map source names to target names
            # 'target_module_name': ['source_param1', 'source_param2'],
            # Example: 'qkv_proj': ['q_proj', 'k_proj', 'v_proj']
        })

    def apply_callbacks(self, module: nn.Module, module_name: str,
                        module_names_breakdown: list[str],
                        weights: dict) -> list[dict]:
        """
        Apply weight transformations for modules that require special handling.
        Args:
            module: The target module
            module_name: The specific module name being processed
            module_names_breakdown: Module path components
            weights: Source weights dictionary
        Returns:
            List of transformed weight dictionaries
        """
        module_weights = []

        for new_name in self._mapping[module_name]:
            # Filter weights for this specific parameter
            fw = self.filter_weights(
                '.'.join(module_names_breakdown + [new_name]), weights)

            # Apply transformation callbacks
            for callback in self._callbacks:
                fw = callback(module, new_name, fw)

            module_weights.append(fw)

        return module_weights

    def should_skip_module(self, module_name: str) -> bool:
        """
        Define which modules should be skipped during loading.
        """
        # Add logic to skip specific modules based on your requirements
        # Examples:
        # - Skip LoRA-specific modules
        # - Skip temporary/auxiliary modules

        return super().should_skip_module(module_name)
```

Note: when creating a custom mapper, you can either define a checkpoint-format-specific mapper. For example:

```python
@register_mapper("CUSTOM_FORMAT")
class CustomWeightMapper(BaseWeightMapper)
```

Alternatively, you can define a checkpoint-model-specific mapper. For example:

```python
@register_mapper("CUSTOM_FORMAT", "Gemma3ForCausalLM")
class CustomWeightMapper(BaseWeightMapper)
```

By setting the model name, the registered mapper will be asscoiated with the specific model.

---

# Disaggregated Serving

- [Motivation](#Motivation)
- [KV Cache Exchange](#KV-Cache-Exchange)
  - [Multi-backend Support](#Multi-backend-Support)
  - [NIXL Backend Configuration](#nixl-backend-configuration)
  - [Overlap Optimization](#Overlap-Optimization)
  - [Cache Layout Transformation](#Cache-Layout-Transformation)
- [Usage](#Usage)
  - [Dynamo](#Dynamo)
  - [trtllm-serve](#trtllm-serve)
- [Environment Variables](#Environment-Variables)
- [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)

## Motivation

LLM inference has two stages: context (prefill) and generation (decode) phases. The context phase computes KV cache for prompt tokens whereas the generation phase generates tokens one by one using cached values. These phases have different compute characteristics.

There are two ways of serving LLM inference requests:

* Aggregated LLM serving (sometimes called in-flight batching or IFB in this tech blog), in which the context and generation phases are run on the same GPU.
* Disaggregated LLM serving, in which the context and generation phases are run on different GPUs.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture1.png" width="640" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. The execution timeline of aggregated LLM serving</em></sub></p>

In aggregated LLM serving, both the context and generation phases share the same GPU resources and parallelism strategy. This can lead to interference where context processing delays token generation, increasing token-to-token latency (TPOT) and reducing interactivity. This is illustrated in Figure 1 which shows the execution timeline for aggregated LLM serving. Aggregated LLM serving also forces a single GPU type and parallelism configuration for both phases, even though their compute needs differ. As a result, optimizing for one metric such as time-to-first-token (TTFT), often comes at the expense of another metric such as TPOT.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture2.png" width="580" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 2. The execution timeline of dis-aggregated LLM serving</em></sub></p>

Disaggregated serving resolves these challenges by decoupling the two phases, allowing each to run on separate GPU pools and using different parallelism strategies. This separation removes the interference between context and generation phases, as shown in Figure 2, and enables independent optimization of TTFT and TPOT. Although disaggregation incurs overhead for transferring the KV cache blocks from context to generation GPUs, the advantages can be substantial—particularly for workloads with long input sequences and moderate output lengths where interference is most severe.

You can also refer to [this paper](https://arxiv.org/pdf/2506.05508) for more details about the rational and design considerations of disaggregated serving.

## KV Cache Exchange

### Multi-backend Support

In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache manager and the underlying communication libraries, as shown in Figure 3. The KV cache exchange module is responsible for efficient transmission and reception of the cache, promptly releasing cache space, and performing cache layout conversions during the exchange process. Currently, mainstream communication protocols—MPI, UCX, and NIXL—are all supported by TensorRT-LLM, and the underlying communication protocols utilize RDMA / NVLink. Currently, we recommend using UCX and NIXL backends, as we are adding a dynamic scaling mechanism on top of them—specifically, dynamic node joining and leaving. This allows customers to adjust the load based on traffic demands or switch roles between context and generation dynamically.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture6.png" width="890" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 3. KV cache exchange architecture</em></sub></p>

### NIXL Backend Configuration

NIXL supports multiple underlying communication backends for KV cache exchange in disaggregated serving. The backend can be configured using the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.

**Supported NIXL backends:**
- **UCX** (default)
- **LIBFABRIC** (available from v0.16.0)

If an unsupported backend is specified, NIXL will automatically fall back to UCX.

For detailed setup instructions and configuration examples, please refer to the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md).

### Overlap Optimization

To optimize the overall performance of disaggregated serving, TensorRT LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture7.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 4. KV cache exchange timing diagram</em></sub></p>

### Cache Layout Transformation

To minimize KV cache transmission latency, TensorRT LLM currently uses direct transmission between device memories for cache transfer. The KV cache transmission supports using different parallel strategies for the context and generation phases. In such cases, careful orchestration of KV cache block mapping is required. Figure 5 illustrates this using the example of context phase with TP2 and generation phase with PP2.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture8.png" width="680" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 5. KV cache layout conversion</em></sub></p>

The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT LLM provides a set of environment variables for selection in different environments. Please refer to the following section for details [Environment Variables](#Environment-Variables).

## Usage

### Dynamo

The first approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 7. Dynamo integration with disaggregated service</em></sub></p>

In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.

Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.

For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).

### trtllm-serve

The second approach to evaluate disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture3.png" width="800" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 6. `trtllm-serve` integration with disaggregated service</em></sub></p>


To run TRT-LLM in disaggregated mode, you must first launch context (prefill) and generation (decode) servers using `trtllm-serve`.

We use the `cache_transceiver_config` configuration to set up disaggregated serving, which includes the following parameters:

```yaml
cache_transceiver_config:
  backend: <str>
  max_tokens_in_buffer: <int>
```

`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`. The default backend is NIXL.

Note: NIXL supports multiple underlying backends configured via the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable:
- `UCX` (default)
- `LIBFABRIC` (available from v0.16.0)

`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.

For example, you could launch two context servers and one generation servers as follows:

```

# Generate context_config.yml
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > context_config.yml

# Start Context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --backend pytorch --config ./context_config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --backend pytorch --config ./context_config.yml &> log_ctx_1 &

# Generate gen_config.yml
echo -e "cache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > gen_config.yml

# Start Generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --backend pytorch --config ./gen_config.yml &> log_gen_0 &
```
Once the context and generation servers are launched, you can launch the disaggregated
server, which will accept requests from clients and do the orchestration between context
and generation servers. The disaggregated server can be launched with:

```
trtllm-serve disaggregated -c disagg_config.yaml
```
where `disagg_config.yaml` contains information about the context and generation servers. For the current example,
it would look like:
```
hostname: localhost
port: 8000
backend: pytorch
context_servers:
  num_instances: 2
  urls:
      - "localhost:8001"
      - "localhost:8002"
generation_servers:
  num_instances: 1
  urls:
      - "localhost:8003"
```

When routing requests to the context servers, the disaggregated server will mark the requests as "context-only" to skip the generation phase. Similarly,
when routing requests to the generation servers, the disaggregated server will mark the requests as "generation-only" to skip the context phase.

Clients can then send requests to the disaggregated server at `localhost:8000`, which is an OpenAI compatible endpoint. For example,  you can send requests to the disaggregated server using curl:
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 16,
        "temperature": 0
    }' -w "\n"
```

#### Launching disaggregated servers on SLURM clusters

Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm).

## Environment Variables

TRT-LLM uses some environment variables to control the behavior of disaggregated service.

* `TRTLLM_NIXL_KVCACHE_BACKEND`: When using NIXL as the cache transceiver backend, this variable specifies the underlying communication backend for NIXL. Valid options are:
  - `UCX` (default)
  - `LIBFABRIC` (available from v0.16.0)
  - If an unsupported value is specified, NIXL will automatically fall back to UCX

* `TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP`: If set to `1`, generationExecutor will not overlap KV cache transfer with model inference. The default value is `0`.

* `TRTLLM_ENABLE_KVCACHE_RECEIVE_PARALLEL`:  When the generation rank receives KV cache from multiple context ranks within a single context instance, it will receive KV cache from each rank sequentially. If set to `1`, the generation rank will receive KV cache from each rank within one context instance in parallel. The default value is `0`.

* `TRTLLM_REQUEST_KV_CACHE_CONCURRENT`: If set to `1`, generationExecutor prepares independent resources for each context executor to receive KV cache, requests whose KV cache are received from different context executors will be processed concurrently. If set to `0`, the generation executor will reuse the same resource to process KV cache transfer for each request sequentially, reducing the resources used by KV cache transmission and thereby lowering the risk of running out of memory. The default value is `0`.

* `TRTLLM_TRY_ZCOPY_FOR_KVCACHE_TRANSFER`: TRT-LLM typically copies non-contiguous data into a temporary buffer before sending KV cache. If set to `1`, TRT-LLM will attempt to directly transmit each KV cache block, eliminating extra copies. The default value is `0`.

* `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE`: By default, TRT-LLM uses a `stream-ordered memory allocator` to allocate temporary buffers. If this environment variable is set to #Size, TRT-LLM will use `cudaMalloc` to allocate buffer of size #Size for KV cache transmission. The default value is `512MB`. Users can set `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=1GB` to allocate a 1 GB buffer with `cudaMalloc` for KV cache transmission.

* `TRTLLM_KVCACHE_TRANSFER_USE_ASYNC_BUFFER`: If set to `1`, TRT-LLM will use `cudaMallocAsync` to allocate buffers for KV cache transmission. The default value is `0`. This environment variable only takes effect when `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE` is greater than 0.

* `TRTLLM_KVCACHE_SEND_MAX_CONCURRENCY_NUM`: The maximum number of concurrent KV cache sends. The default value is `1`. This environment variable only takes effect when `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE` is greater than 0.

There are some other useful environment variables that may help when encountering failures or performance issues.

* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.

* ``UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.

## Troubleshooting and FAQ

### General FAQs

*Q. What are the limitations of disaggregated serving in TRT-LLM?*

A. Currently, only decoder-only models and beam width of 1  are supported. Also the KV cache at each layer of the model is required to be homogeneous, with the same data type and the same number of attention heads.

*Q. When using the TRT backend, is the engine used for disaggregated serving different from other engines?*

A. No. There are no special requirements for the arguments to build engine.

*Q. When using the TRT backend, do the engines used by the context and generation instances need to be the same?*

A. No. The engines used by context and generation instances can be different, and their parallelism can be heterogeneous, i.e., TP,PP can be different, and TRT-LLM will handle the heterogeneity of KV cache.

*Q. Can a TRT-LLM server instance handle both context-only requests and generation-only requests?*

A. Yes, but it's not recommended. TRT-LLM does not implement optimal scheduling for the case where the instance handles mixed context-only requests and generation-only requests. It's better to run context-only requests and generation-only requests on sets of servers.

*Q. Does disaggregated serving in TRT-LLM support multi-gpu and multi-node?*

A. Yes, it's recommended that different server instances use different GPUs. We support running context and generation servers on the same node or different nodes. The `CUDA_VISIBLE_DEVICES` env variable can be used to control which GPUs are used by each instance.

### Debugging FAQs

*Q. Why does NIXL fail to use LIBFABRIC backend even when `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` is set?*

A: The TensorRT-LLM container doesn't include the NIXL LIBFABRIC plugin by default. You need to either:

1. **Rebuild NIXL**: Install libfabric and hwloc first, then rebuild NIXL following the installation instructions above
2. **Use a pre-compiled plugin**: If you have a compatible `libplugin_LIBFABRIC.so`, set `NIXL_PLUGINS_DIR` to point to its directory

Please see the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md) for detailed installation and configuration instructions.

*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*

A. please set `backendType` of `CacheTransceiverConfig`.
```cpp
ExecutorConfig executorConfig{...};

executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
```
*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*

A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*

A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.

*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?*

A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. A few UCX environment variables can be adjusted when your servers have different NVLink domains:

* `UCX_CUDA_IPC_ENABLE_MNNVL`: Set to `n`. This also can reduce UCX timeout error messages like `UCX  ERROR   cuMemImportFromShareableHandle failed: invalid resource handle`, although these errors don't necessarily cause your trtllm-serve to fail.

* `UCX_NET_DEVICES`: Check if this is set correctly, or unset this variable to allow UCX to use all possible devices.

* `UCX_RNDV_SCHEME`: Set to `get_zcopy` or `put_zcopy` on GB200 for better performance. The default value is `auto`.

---

# Feature Combination Matrix

| Feature                    | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | Chunked Prefill | MTP      | EAGLE-3(One Model Engine) | EAGLE-3(Two Model Engine) | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Slide Window Attention | Logits Post Processor | Guided Decoding | LoRA |
| -------------------------- | ----------------- | ---------- | -------------------------- | --------------------- | --------------- | -------- | ------------------------- | ------------------------- | ------------- | ---------------- | -------------- | ---------------------- | --------------------- | --------------- | ---- |
| Overlap Scheduler          | ---               |            |                            |                       |                 |          |                           |                           |               |                  |                |                        |                       |                 |      |
| CUDA Graph                 | Yes               | ---        |                            |                       |                 |          |                           |                           |               |                  |                |                        |                       |                 |      |
| Attention Data Parallelism | Yes               | Yes        | ---                        |                       |                 |          |                           |                           |               |                  |                |                        |                       |                 |      |
| Disaggregated Serving      | Yes               | Yes        | Yes                        | ---                   |                 |          |                           |                           |               |                  |                |                        |                       |                 |      |
| Chunked Prefill            | Yes               | Yes        | Yes                        | Yes                   | ---             |          |                           |                           |               |                  |                |                        |                       |                 |      |
| MTP                        | Yes               | Yes        | Yes                        | Yes                   | Yes             | ---      |                           |                           |               |                  |                |                        |                       |                 |      |
| EAGLE-3(One Model Engine)  | Yes               | Yes        | Yes                        | Yes                   | Yes             | No       | ---                       |                           |               |                  |                |                        |                       |                 |      |
| EAGLE-3(Two Model Engine)  | Yes               | Yes        | Yes                        | Yes                   | Yes             | No       | No                        | ---                       |               |                  |                |                        |                       |                 |      |
| Torch Sampler              | Yes               | Yes        | Yes                        | Yes                   | Yes             | Yes      | Yes                       | Yes                       | ---           |                  |                |                        |                       |                 |      |
| TLLM C++ Sampler           | Yes               | Yes        | Yes                        | Yes                   | Yes             | No       | No                        | No                        | No            | ---              |                |                        |                       |                 |      |
| KV Cache Reuse             | Yes               | Yes        | Yes                        | Yes                   | Yes             | Yes      | Yes                       | Yes                       | Yes           | Yes              | ---            |                        |                       |                 |      |
| Slide Window Attention     | Yes               | Yes        | Yes                        | Yes                   | Yes             | No       | Untested                  | Untested                  | Yes           | Yes              | Yes            | ---                    |                       |                 |      |
| Logits Post Processor      | Yes               | Yes        | Yes                        | No                    | Yes             | No       | No                        | No                        | Yes           | Yes              | Yes            | Yes                    | ---                   |                 |      |
| Guided Decoding            | Yes               | Yes        | Yes                        | Yes                   | Yes             | Yes      | Yes                       | Yes                       | Yes           | Yes              | Yes            | Yes                    | Yes                   | ---             |      |
| LoRA                       | Yes               | No         | Untested                   | Untested              | Untested        | Untested | Untested                  | Untested                  | Yes           | Yes              | Yes            | Yes                    | Yes                   | Untested        | ---  |

---

# Guided Decoding

Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are amenable to a user-specified grammar (e.g., JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) grammar).

TensorRT LLM supports two grammar backends:
* [XGrammar](https://github.com/mlc-ai/xgrammar/blob/v0.1.21/python/xgrammar/matcher.py#L341-L350): Supports JSON schema, regular expression, EBNF and [structural tag](https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html).
* [LLGuidance](https://github.com/guidance-ai/llguidance/blob/v1.1.1/python/llguidance/_lib.pyi#L363-L366): Supports JSON schema, regular expression, EBNF.


## Online API: `trtllm-serve`

If you are using `trtllm-serve`, enable guided decoding by specifying `guided_decoding_backend` with `xgrammar` or `llguidance` in the YAML configuration file, and pass it to `--config`. For example,

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

```bash
cat > config.yaml <<EOF
guided_decoding_backend: xgrammar
EOF

trtllm-serve nvidia/Llama-3.1-8B-Instruct-FP8 --config config.yaml
```

You should see a log like the following, which indicates the grammar backend is successfully enabled.

```txt
......
[TRT-LLM] [I] Guided decoder initialized with backend: GuidedDecodingBackend.XGRAMMAR
......
```

### JSON Schema

Define a JSON schema and pass it to `response_format` when creating the OpenAI chat completion request. Alternatively, the JSON schema can be created using [pydantic](https://docs.pydantic.dev/latest/).

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

json_schema = {
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "pattern": "^[\\w]+$"
        },
        "population": {
            "type": "integer"
        },
    },
    "required": ["name", "population"],
}
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "Give me the information of the capital of France in the JSON format.",
    },
]
chat_completion = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=messages,
    max_completion_tokens=256,
    response_format={
        "type": "json",
        "schema": json_schema
    },
)

message = chat_completion.choices[0].message
print(message.content)
```

The output would look like:
```txt
{
    "name": "Paris",
    "population": 2145200
}
```

### Regular expression

Define a regular expression and pass it to `response_format` when creating the OpenAI chat completion request.

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "What is the capital of France?",
    },
]
chat_completion = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=messages,
    max_completion_tokens=256,
    response_format={
        "type": "regex",
        "regex": "(Paris|London)"
    },
)

message = chat_completion.choices[0].message
print(message.content)
```

The output would look like:
```txt
Paris
```

### EBNF grammar

Define an EBNF grammar and pass it to `response_format` when creating the OpenAI chat completion request.

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

ebnf_grammar = """root ::= description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""
messages = [
    {
        "role": "system",
        "content": "You are a helpful geography bot."
    },
    {
        "role": "user",
        "content": "Give me the information of the capital of France.",
    },
]
chat_completion = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=messages,
    max_completion_tokens=256,
    response_format={
        "type": "ebnf",
        "ebnf": ebnf_grammar
    },
)

message = chat_completion.choices[0].message
print(message.content)
```

The output would look like:
```txt
Paris is the capital of France
```

### Structural tag

Define a structural tag and pass it to `response_format` when creating the OpenAI chat completion request.

Structural tag is supported by `xgrammar` backend only. It is a powerful and flexible tool to represent the LLM output constraints. Please see [structural tag usage](https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html) for a comprehensive tutorial. Below is an example of function calling with customized function call format for `Llama-3.1-8B-Instruct`.


```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

tool_get_current_weather = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city to find the weather for, e.g. 'San Francisco'",
                },
                "state": {
                    "type": "string",
                    "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'",
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"],
                },
            },
            "required": ["city", "state", "unit"],
        },
    },
}

tool_get_current_date = {
    "type": "function",
    "function": {
        "name": "get_current_date",
        "description": "Get the current date and time for a given timezone",
        "parameters": {
            "type": "object",
            "properties": {
                "timezone": {
                    "type": "string",
                    "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'",
                }
            },
            "required": ["timezone"],
        },
    },
}

system_prompt = f"""# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else fallback to brave_search
You have access to the following functions:
Use the function 'get_current_weather' to: Get the current weather in a given location
{tool_get_current_weather["function"]}
Use the function 'get_current_date' to: Get the current date and time for a given timezone
{tool_get_current_date["function"]}
If a you choose to call a function ONLY reply in the following format:
<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
where
start_tag => `<function`
parameters => a JSON dict with the function argument name as key and function argument value as value.
end_tag => `</function>`
Here is an example,
<function=example_function_name>{{"example_name": "example_value"}}</function>
Reminder:
- Function calls MUST follow the specified format
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
- Always add your sources when using search results to answer the user query
You are a helpful assistant."""
user_prompt = "You are in New York. Please get the current date and time, and the weather."

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {
        "role": "user",
        "content": user_prompt,
    },
]

chat_completion = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=messages,
    max_completion_tokens=256,
    response_format={
        "type": "structural_tag",
        "format": {
            "type": "triggered_tags",
            "triggers": ["<function="],
            "tags": [
                {
                    "begin": "<function=get_current_weather>",
                    "content": {
                        "type": "json_schema",
                        "json_schema": tool_get_current_weather["function"]["parameters"]
                    },
                    "end": "</function>",
                },
                {
                    "begin": "<function=get_current_date>",
                    "content": {
                        "type": "json_schema",
                        "json_schema": tool_get_current_date["function"]["parameters"]
                    },
                    "end": "</function>",
                },
            ],
        },
    },
)

message = chat_completion.choices[0].message
print(message.content)
```

The output would look like:
```txt
<function=get_current_date>{"timezone": "America/New_York"}</function>
<function=get_current_weather>{"city": "New York", "state": "NY", "unit": "fahrenheit"}</function>
```


## Offline API: LLM API

If you are using LLM API, enable guided decoding by specifying `guided_decoding_backend` with `xgrammar` or `llguidance` when creating the LLM instance. For example,

```python
from tensorrt_llm import LLM

llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
```

### JSON Schema

Create a `GuidedDecodingParams` with the `json` field specified with a JSON schema, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`. Alternatively, the JSON schema can be created using [pydantic](https://docs.pydantic.dev/latest/).

```python
from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

if __name__ == "__main__":
    llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")

    json_schema = {
        "type": "object",
        "properties": {
            "name": {
                "type": "string",
                "pattern": "^[\\w]+$"
            },
            "population": {
                "type": "integer"
            },
        },
        "required": ["name", "population"],
    }
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "Give me the information of the capital of France in the JSON format.",
        },
    ]
    prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    output = llm.generate(
        prompt,
        sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(json=json_schema)),
    )
    print(output.outputs[0].text)
```

The output would look like:
```txt
{
  "name": "Paris",
  "population": 2145206
}
```


### Regular expression

Create a `GuidedDecodingParams` with the `regex` field specified with a regular expression, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.

```python
from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

if __name__ == "__main__":
    llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")

    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "What is the capital of France?",
        },
    ]
    prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    output = llm.generate(
        prompt,
        sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(regex="(Paris|London)")),
    )
    print(output.outputs[0].text)
```

The output would look like:
```txt
Paris
```

### EBNF grammar

Create a `GuidedDecodingParams` with the `grammar` field specified with an EBNF grammar, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.

```python
from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

if __name__ == "__main__":
    llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")

    ebnf_grammar = """root ::= description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""
    messages = [
        {
            "role": "system",
            "content": "You are a helpful geography bot."
        },
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ]
    prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    output = llm.generate(
        prompt,
        sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(grammar=ebnf_grammar)),
    )
    print(output.outputs[0].text)
```

The output would look like:
```txt
Paris is the capital of France
```

### Structural tag

Create a `GuidedDecodingParams` with the `structural_tag` field specified with a structural tag string, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.

Structural tag is supported by `xgrammar` backend only. It is a powerful and flexible tool to represent the LLM output constraints. Please see [structural tag usage](https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html) for a comprehensive tutorial. Below is an example of function calling with customized function call format for `Llama-3.1-8B-Instruct`.

```python
import json
from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

if __name__ == "__main__":
    llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")

    tool_get_current_weather = {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'San Francisco'",
                    },
                    "state": {
                        "type": "string",
                        "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city", "state", "unit"],
            },
        },
    }

    tool_get_current_date = {
        "type": "function",
        "function": {
            "name": "get_current_date",
            "description": "Get the current date and time for a given timezone",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {
                        "type": "string",
                        "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'",
                    }
                },
                "required": ["timezone"],
            },
        },
    }

    system_prompt = f"""# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else fallback to brave_search
You have access to the following functions:
Use the function 'get_current_weather' to: Get the current weather in a given location
{tool_get_current_weather["function"]}
Use the function 'get_current_date' to: Get the current date and time for a given timezone
{tool_get_current_date["function"]}
If a you choose to call a function ONLY reply in the following format:
<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
where
start_tag => `<function`
parameters => a JSON dict with the function argument name as key and function argument value as value.
end_tag => `</function>`
Here is an example,
<function=example_function_name>{{"example_name": "example_value"}}</function>
Reminder:
- Function calls MUST follow the specified format
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
- Always add your sources when using search results to answer the user query
You are a helpful assistant."""
    user_prompt = "You are in New York. Please get the current date and time, and the weather."
    structural_tag = {
        "type": "structural_tag",
        "format": {
            "type": "triggered_tags",
            "triggers": ["<function="],
            "tags": [
                {
                    "begin": "<function=get_current_weather>",
                    "content": {
                        "type": "json_schema",
                        "json_schema": tool_get_current_weather["function"]["parameters"]
                    },
                    "end": "</function>",
                },
                {
                    "begin": "<function=get_current_date>",
                    "content": {
                        "type": "json_schema",
                        "json_schema": tool_get_current_date["function"]["parameters"]
                    },
                    "end": "</function>",
                },
            ],
        },
    }

    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ]
    prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    output = llm.generate(
        prompt,
        sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(structural_tag=json.dumps(structural_tag))),
    )
    print(output.outputs[0].text)
```

The output would look like:
```txt
<function=get_current_date>{"timezone": "America/New_York"}</function>
<function=get_current_weather>{"city": "New York", "state": "NY", "unit": "fahrenheit"}</function>
```

---

# Helix Parallelism

Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional attention-FFN disaggregation (AFD) techniques, which spatially separate attention and FFN blocks onto different GPUs, Helix temporally separates them by reconfiguring the same GPUs.

For all details, see the original paper:
[Helix Parallelism: Rethinking Sharding Strategies for
Interactive Multi-Million-Token LLM Decoding](https://arxiv.org/pdf/2507.07120)

## How Helix Works

In Helix parallelism:

- **KV cache distribution**: The KV cache is partitioned across CP ranks during generation, with each rank responsible for a portion of the cached context
- **Attention computation**: Each rank computes partial attention over its local KV cache shard
- **Attention postprocessing**: Partial results are combined / corrected across ranks to produce the final attention output
- **FFN layers**: CP ranks are repurposed as tensor parallelism (TP) ranks for FFN/MoE layers, maximizing GPU utilization

## When to Use Helix

Helix parallelism provides performance benefits when **all** of the following conditions apply:

1. **Disaggregated serving**: Helix is designed for generation servers in a disaggregated (prefill/decode split) deployment architecture
2. **Long input sequences**: Performance gains typically appear with input sequence lengths **>64K tokens** or more
3. **Low batch sizes**: Optimal for latency-sensitive workloads with high tokens/second/user requirements

On a typical latency vs. throughput Pareto curve, Helix targets operating points toward the right side (low latency, high per-user throughput).

## Supported Models

Helix parallelism currently supports models using **Multi-head Latent Attention (MLA)** on Blackwell GPU architecture:

- DeepSeek-V3 / DeepSeek-V3-Lite

## Configuration

### Configuration Parameters

Please set the following parameters for the generation servers in disaggregated mode. Example can be seen in the e2e accuracy test mentioned below. 

| Parameter | Description | Required |
|-----------|-------------|----------|
| `context_parallel_size` | Number of GPUs for context parallelism (≥2 for Helix) | Yes |
| `cp_config.cp_type` | Must be `"HELIX"` or `CpType.HELIX` | Yes |
| `cp_config.tokens_per_block` | Tokens per KV cache block | Yes |
| `kv_cache_config.tokens_per_block` | Must match `cp_config.tokens_per_block` | Yes |

### JSON Configuration (for YAML/JSON configs)

```json
{
    "context_parallel_size": 2,
    "cp_config": {
        "cp_type": "HELIX",
        "tokens_per_block": 32
    },
    "kv_cache_config": {
        "tokens_per_block": 32
    }
}
```

## Testing Helix with TensorRT-LLM

### Unit Test: MLA Module Correctness

The simplest correctness test validates the [MLA attention module](../../../tensorrt_llm/_torch/modules/attention.py) with Helix enabled:

```bash
# Run the MLA Helix unit test
pytest tests/unittest/_torch/modules/test_mla_helix.py -v
```

This test verifies that attention outputs match between single-GPU and Helix-parallelized execution.

### End-to-End Accuracy test

For end-to-end validation, the accuracy benchmark evaluates DeepSeek-V3-Lite in disaggregated mode on MMLU and GSM8K benchmarks:

Test location: `tests/integration/defs/accuracy/test_disaggregated_serving.py`  
Test name: `TestDeepSeekV3Lite::test_auto_dtype_with_helix`

This test demonstrates proper disaggregated server configuration with Helix.

---

# KV Cache Connector

The KV Cache Connector is a flexible interface in TensorRT-LLM that enables remote or external access to the Key-Value (KV) cache. It allows developers to implement custom logic for loading, saving, and managing KV cache blocks, extending the capabilities of the standard KV cache manager.

This document explains the KV Cache Connector architecture, common use cases, and provides a detailed walkthrough of the included example.

## Use Cases

The KV Cache Connector is designed to support a variety of advanced serving scenarios:

1. **KV Cache Offloading**: Move KV cache blocks from GPU memory to cheaper/larger storage (CPU RAM, NVMe SSD, or network storage) when they are not immediately needed, and reload them when required.
2. **Custom Disaggregated Serving**: Separate the prefill (context processing) and decode (token generation) phases onto different instances or machines. The connector can be used to transmit the KV cache generated during prefill to the decode instances.
3. **KV Cache Sharing / P2P Transfer**: Share KV cache states between different model instances or across peer-to-peer connections.

## Architecture

The connector architecture is split into two main components:

* **Scheduler (Leader)**: Responsible for orchestration. It decides *what* needs to be loaded or saved and builds metadata instructions. It runs only on the leader rank (rank 0).
* **Worker**: Responsible for execution. It receives metadata from the scheduler and performs the actual data transfers (loading/saving) on the KV cache tensors. It runs on all ranks.

### API Reference

To implement a custom connector, you must subclass `KvCacheConnectorScheduler` and `KvCacheConnectorWorker`.

#### 1. Scheduler (Leader) Interface (`KvCacheConnectorScheduler`)

These methods run on the leader process and drive the connector's behavior.

* **`build_connector_meta(self, scheduler_output: SchedulerOutput) -> object`**
  * **Description**: The core orchestration method. Called during the scheduling phase. It examines the current requests and decides which blocks need to be loaded from or saved to the external store.
  * **Arguments**: `scheduler_output` contains information about new requests, blocks allocated, and current request states.
  * **Returns**: An arbitrary metadata object (picklable) that describes the tasks for the workers. This object is broadcasted to all workers.

* **`get_num_new_matched_tokens(self, request: LlmRequest, num_computed_tokens: int) -> tuple[int, bool]`**
  * **Description**: Called when a new request arrives. It checks to see if any KV cache can be loaded from an external KV store.
  * **Returns**: A tuple `(num_tokens, is_async)`. `num_tokens` is the number of tokens found in the external cache. `is_async` indicates if the loading will happen asynchronously (background) or requires blocking.

* **`request_finished(self, request: LlmRequest, cache_block_ids: list[int]) -> bool`**
  * **Description**: Called when a request completes generation.
  * **Returns**: A boolean indicating if an asynchronous save operation is underway. If `True`, the system waits for the operation to complete before releasing the KV cache blocks.

* **`update_state_after_alloc(self, request: LlmRequest, block_ids: list[int])`**
  * **Description**: a callback to update internal state after KV cache blocks have been allocated for the prefill.

#### 2. Worker Interface (`KvCacheConnectorWorker`)

These methods run on all workers (GPU processes) and interact with the actual GPU data.

* **`register_kv_caches(self, kv_cache_tensor: torch.Tensor)`**
  * **Description**: Called at initialization. Provides the worker with the GPU KV cache tensors.
  * **Arguments**: `kv_cache_tensor` is the underlying storage tensor for the KV cache.

* **`start_load_kv(self, stream: torch.cuda.Stream)`**
  * **Description**: Initiates the loading of KV blocks from the external source into the GPU memory.
  * **Arguments**: `stream` is the CUDA stream where the forward pass is executed in.

* **`wait_for_layer_load(self, layer_idx: int, stream: torch.cuda.Stream)`**
  * **Description**: A synchronization point. Ensures that the KV cache for a specific layer is fully loaded before the model attempts to perform the forward pass on that layer.

* **`save_kv_layer(self, layer_idx: int, stream: torch.cuda.Stream)`**
  * **Description**: Triggers the saving of a specific layer's KV cache.

* **`wait_for_save(self, stream: torch.cuda.Stream)`**
  * **Description**: A synchronization point to ensure all save operations are enqueued or completed.

* **`get_finished(self, finished_gen_req_ids, started_loading_req_ids) -> tuple[list[int], list[int]]`**
  * **Description**: Polled by the runtime to check the status of asynchronous operations.
  * **Returns**: Two lists of request IDs: those that have finished saving, and those that have finished loading.

## Example Implementation

The file `examples/llm-api/llm_kv_cache_connector.py` provides a reference implementation of a **Persistent KV Cache**.

### Overview

This example implements a file-system based KV cache.
1.**Save**: When a request finishes or needs to be swapped out, its KV blocks are saved to disk as `.pt` files.
2.**Load**: When a new request arrives with the same prompt prefix, the connector identifies the cached files and loads them back into GPU memory, skipping re-computation.

### Implementation Details

* **Metadata**: The example defines a `PersistentKvCacheConnectorMetadata` dataclass containing lists of `(file_path, block_id)` tuples for both loading and saving. This simple structure allows the Scheduler to tell the Worker exactly which file corresponds to which GPU block index.

* **Hashing Strategy**: The `PersistentKvCacheConnectorLeader` hashes the token sequence of a block to generate a unique filename (e.g., `hash_value.pt`). This acts as the lookup key.

* **Worker Logic**:
  * `start_load_kv`: Iterates through the load list provided in the metadata, loads the `.pt` file to CPU, and copies it to the specific `block_id` in the GPU tensor.
  * `wait_for_save`: Performs the reverse. It copies data from the GPU `block_id` to CPU and saves it to disk using `torch.save`.

### Limitations & Patterns

This example illustrates the API mechanics but has several limitations that make it unsuitable for high-performance production use without modification:

1. **Blocking I/O**: The example uses `torch.load` and `torch.save` synchronously. In a real implementation, these should be offloaded to a background thread or asynchronous I/O handler to avoid stalling the GPU.
2. **Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks. It does not handle partial cache hits.
3. **FileSystem Latency**: Storing one file per block can create high filesystem overhead.

### Usage

To run the example:

```bash
python examples/llm-api/llm_kv_cache_connector.py <model_path>
```

The script demonstrates:

1. Generating text for a prompt (First run).
2. Destroying the LLM instance.
3. Creating a new LLM instance with the same connector config.
4. Generating text for the same prompt (Second run).
5. Asserting that the outputs match, proving the state was correctly restored from the disk cache.

---

# KV Cache System

The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calculations. The TensorRT LLM KV cache system also supports reuse across requests and uses a suite of tools like offloading and prioritized eviction to increase reuse. It supports variable attention window sizes and Multi-Head Attention (MHA) optimization techniques such as MQA and GQA.

## The Basics

The KV cache is a pool of blocks that can hold KV state for a fixed number of tokens. Multiple layers are packed within a single block, which requires all the layers to have the same number of heads and the same attention window size. A separate pool is created for each combination of attention window size and number of heads to support variable attention window size and optimization techniques like GQA.

The number of tokens that can be stored in a single block can be set by user when the model engine is created. It must be a power of two greater than 1. Blocks are assigned to requests as needed. Blocks are stored in a search structure as they are filled by requests, this allows later requests to reuse KV state if they have a matching prefix.

If more than one pool is created, available memory is divided among the pools. The fraction to assign to each pool is determined during initialization and is static. This is not optimal and we are working on providing a better solution.

## Reuse Across Requests

Blocks containing KV state computed for previous requests are stored in a radix search tree as soon as they are filled. A search is performed when a new request is added, and matched blocks are reused instead of calculated. Blocks that are reused can be shared among multiple requests, so reuse saves memory as well as computations.

Blocks remain reusable until they are evicted from the search tree. Eviction happens when a new (blank) block is needed. The core eviction scheme is prioritized LRU. All blocks are assigned a priority between 0 and 100 (100 being most important). All blocks of the lowest priority must be evicted before any blocks of the next priority can be evicted. If all blocks have the same priority, the least recently used block is evicted.

When a block is evicted from primary memory, its KV state is copied to a block in secondary memory. The secondary memory block remains in the search tree, so the block remains reusable until it is evicted from secondary memory. Eviction from secondary memory happens when a new block in secondary memory is needed to offload a primary block. The eviction scheme is the same for primary and secondary blocks.

One caveat in the current code is that only leaf blocks can be evicted (leaves are blocks with no descendants in the radix tree). This design works well for full attention layers, but not for limited attention layers. This will be fixed in a future version.

### Retention Policy

Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. Blocks with lower priority scores will be freed preferentially to blocks with higher priority. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect. Priority reverts to the default of 35 after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect. Priority reverts to the default after expiration. Any property that expects a duration can be set to None. This indicates that particular part of the retention policy never expires.

Not in use: ```transfer_mode``` is a debug option and should not be used.

See [this example](../examples/kvcacheretentionconfig.md) for an example of how to change block priorities of specific requests by altering their retention policy.

### Speculative Decoding

Reuse across requests is supported by all speculative decoding models. Please see [speculative decoding](speculative-decoding.md) for more details.

## Limited Attention Window Size

TensorRT LLM takes advantage of layers with limited attention window size in order to reduce computations and memory usage. Blocks that leave the attention window are freed and placed on the radix search tree so they can be reused.

## MQA / GQA

TensorRT LLM takes advantage of grouped query attention in order to save memory. KV cache will create blocks with only enough space to store state for the discrete query head groups. For MHA, there is one group per head, for MQA there is a single group for all the heads. GQA strikes a balance between these two.

## Controlling KV Cache Behavior

Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of the KV cache system.

See [this example](../examples/kvcacheconfig.md) for an example of how to use KvCacheConfig to control KV cache behavior.

### Datatype

Perhaps the most important property is ```dtype``` which specifies what data type is held in KV cache. The default 'auto' specifies that data type should be inferred from model config.

### How Much Memory is Allocated to KV Cache

Property ```free_gpu_memory_fraction``` is a ratio > 0 and < 1 that specifies how much of free GPU memory should be allocated to KV cache. The default is 90% (ratio of 0.9). If ```max_tokens``` is also set, KV cache will determine how much memory is needed to hold ```max_tokens``` and will allocate the lesser of ```max_tokens``` and ```free_gpu_memory_fraction```.

### Enable/Disable Cross Request Reuse

Block reuse across requests is enabled by default, but can be disabled by setting ```enable_block_reuse``` to False.

### KV Cache Salting for Secure Reuse

KV cache salting provides a security mechanism to control which requests can reuse cached KV states. When a `cache_salt` parameter is provided with a request, the KV cache system will only allow reuse of cached blocks given the same cache salt value. This prevents potential security issues such as prompt theft attacks, where malicious users might try to infer information from cached states of other users' requests.

To use cache salting, specify the `cache_salt` parameter as a string when creating requests. Only requests with matching cache salt values can share cached KV blocks. The salt value can be any non-empty string, such as a user ID, tenant ID, or hash string.

### Enable Offloading to Host Memory

Before a block is evicted from GPU memory, it can optionally be offloaded to host (CPU) memory. The block remains reusable until it is evicted from host memory. When an offloaded block is reused, it is first copied back into GPU memory. Offloading is controlled with property ```host_cache_size``` which specifies how much host memory (in bytes) should be allocated for offloading. The default is 0.

When offloading is enabled, the client can prevent specific blocks from being offloaded by toggling block priority. Blocks with lower priority than a certain threshold are not offloaded; they are evicted directly from GPU memory to reduce traffic between GPU and host. This priority is set with ```secondary_offload_min_priority```. Default value is 35, meaning any block with lower priority than 35 will not be offloaded.

Here is an [example](../../../examples/llm-api/llm_kv_cache_offloading.py) to show how to enable host offloading.

### Partial Reuse

Partial reuse of a block can happen when some but not all tokens are matched. It is enabled by default, but can be disabled by setting ```enable_partial_reuse``` to False.

The property ```copy_on_partial_reuse``` specifies whether a block should be copied or not in order to allow partial reuse. If copying is disabled, a partially matched block can only be reused if no other request is using it. If copying is enabled, partially matched blocks are not reused directly, instead a new block is created and the matched tokens are copied into the new block. This allows multiple requests to partially reuse a block.

### Attention Window Size

Property ```max_attention_window``` specifies the maximum attention window size for each layer in the model as a list of integer values. If the length of this list is less than number of layers, the list is repeated as many times as necessary. For instance, if the model has only full attention layers and maximum sequence length is 4096, you can specify this as ```max_attention_window = [4096]```. If the first layer is full attention, the second layer is limited attention with window size 256 and then this repeats for the remaining layers, you specify this as ```max_attention_window = [4096,256]```. This means first layer is full attention, second layer is limited attention, third layer is full attention, fourth layer is limited attention and so on.

### Deprecated Properties

Properties ```use_uvm``` and ```sink_token_length``` have been deprecated and will be removed in a future release.

---

# Long Sequences

In many real-world scenarios, such as long documents summarization or multi-turn conversations, LLMs are required to perform cognitive tasks across long sequences to get better results. This will present challenges to the LLM inference. TensorRT LLM can support different methods to process long sequences efficiently. This document will introduce those optimization techniques.


## Chunked Context

Chunked context allows TensorRT LLM to divide the input tokens into smaller chunks and batch those chunks with the decode requests.

With the chunked context feature, there are two benefits:
- This can prevent the context phase from becoming a bottleneck, enable more parallelization with tokens in the decode phase, and increase GPU utilization.
- Chunked context allows TensorRT LLM to handle requests with longer contexts while achieving higher concurrency. Since memory usage depends on the number of tokens processed per iteration, chunked context decouples memory consumption from the input request's context length, changing it to the smaller chunk size. This enables TensorRT LLM to process longer contexts without increasing memory requirements, which can also help increase the concurrency under the same memory consumption.

To enable chunked context, please set the `enable_chunked_prefill` in `LLM` API to `True`.
```bash
    llm = LLM(
        ...
        enable_chunked_prefill=True,
        ...
    )
```

Note that if chunked context is enabled, please set the `max_num_tokens` to be an integer multiple of the kv-cache block size `tokens_per_block`, which defaults to 64.

## Chunked attention

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/feat_long_seq_chunked_attention.png" alt="feat_long_seq_chunked_attention" width="240" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. Illustration of chunked attention </em></sub></p>

Instead of splitting the input tokens into smaller chunks for the whole model, chunked attention is another method that is only applied to the attention layers in models.

With chunked attention, the tokens in context requests are split into chunks of a specified size. Then tokens can only attend to other tokens in the same chunk. For example, if the chunk size is 3, we might have a mask illustrated in Figure 1. Each token only needs to attend to at most the past chunk-sized tokens. As a result, both the KV cache size and the attention computation can be significantly reduced.

Currently TensorRT LLM can only support chunked attention in llama4 model with TRTLLM attention backend. TensorRT LLM will read `attention_chunk_size` from the model config. If it is not None, the chunked attention will be enabled with chunk size `attention_chunk_size`. If you want to enable chunked attention to other models, you can set the `attention_chunk_size` in attention API to a valid value.

Note that chunked attention can only be applied to context requests.

## Sliding Window Attention

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/feat_long_seq_chunked_attention.png" alt="feat_long_seq_sliding_win_attn" width="240" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 2. Illustration of sliding window attention </em></sub></p>


Since attention layers are usually the performance bottleneck when processing requests with long sequences, sliding window attention is an effective method to limit the attention span of each token to a fixed size window around it, dramatically reducing the amount of computation and memory required.

Figure 2 shows the sliding window attention mask. Each token will only attend to the past `N` tokens. If the number of past tokens surpasses the max attention window size, `Sliding Window Attention` will be activated.

TensorRT LLM treats the kv cache as a circular buffer to support this feature, which is also called `Cyclic KV Cache`. It only stores the kv cache for the last `N` tokens, where `N` is determined by the `KvCacheConfig.max_attention_window` parameter in `LLM` API. TensorRT LLM allows different `N` values for each layer and users can simply provide a `list[int]` to the `KvCacheConfig.max_attention_window`. To enable this feature, users can set
```bash
    kv_cache_config = KvCacheConfig(
        ...
        max_attention_window = [...],
        ...
    )
    llm = LLM(
        ...
        kv_cache_config=kv_cache_config,
        ...
    )
```
If the number of the provided elements in `KvCacheConfig.max_attention_window` is less than the number of layers, the provided list will be repeated multiple times to the number of layers to set unique values for each layer. However, it's important to note that the memory allocation for the kv cache still relies on the buffer's maximum value.

Note that the `Sliding Window Attention` feature doesn't work with beam searching currently as the context kv cache is shared across beams.

---

# LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.

## Table of Contents
1. [Background](#background)
2. [Basic Usage](#basic-usage)
   - [Single LoRA Adapter](#single-lora-adapter)
   - [Multi-LoRA Support](#multi-lora-support)
3. [Advanced Usage](#advanced-usage)
   - [LoRA with Quantization](#lora-with-quantization)
   - [NeMo LoRA Format](#nemo-lora-format)
   - [Cache Management](#cache-management)
4. [TRTLLM serve with LoRA](#trtllm-serve-with-lora)
   - [YAML Configuration](#yaml-configuration)
   - [Starting the Server](#starting-the-server)
   - [Client Usage](#client-usage)
5. [TRTLLM bench with LORA](#trtllm-bench-with-lora)
   - [YAML Configuration](#yaml-configuration)
   - [Run trtllm-bench](#run-trtllm-bench)

## Background

The PyTorch backend provides LoRA support, allowing you to:
- Load and apply multiple LoRA adapters simultaneously
- Switch between different adapters for different requests
- Use LoRA with quantized models
- Support both HuggingFace and NeMo LoRA formats

## Basic Usage

### Single LoRA Adapter

```python
from tensorrt_llm import LLM
from tensorrt_llm.lora_manager import LoraConfig
from tensorrt_llm.executor.request import LoRARequest
from tensorrt_llm.sampling_params import SamplingParams

# Configure LoRA
lora_config = LoraConfig(
    lora_dir=["/path/to/lora/adapter"],
    max_lora_rank=8,
    max_loras=1,
    max_cpu_loras=1
)

# Initialize LLM with LoRA support
llm = LLM(
    model="/path/to/base/model",
    lora_config=lora_config
)

# Create LoRA request
lora_request = LoRARequest("my-lora-task", 0, "/path/to/lora/adapter")

# Generate with LoRA
prompts = ["Hello, how are you?"]
sampling_params = SamplingParams(max_tokens=50)

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_request]
)
```

### Multi-LoRA Support

```python
# Configure for multiple LoRA adapters
lora_config = LoraConfig(
    lora_target_modules=['attn_q', 'attn_k', 'attn_v'],
    max_lora_rank=8,
    max_loras=4,
    max_cpu_loras=8
)

llm = LLM(model="/path/to/base/model", lora_config=lora_config)

# Create multiple LoRA requests
lora_req1 = LoRARequest("task-1", 0, "/path/to/adapter1")
lora_req2 = LoRARequest("task-2", 1, "/path/to/adapter2")

prompts = [
    "Translate to French: Hello world",
    "Summarize: This is a long document..."
]

# Apply different LoRAs to different prompts
outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_req1, lora_req2]
)
```

## Advanced Usage

### LoRA with Quantization

```python
from tensorrt_llm.models.modeling_utils import QuantConfig
from tensorrt_llm.quantization.mode import QuantAlgo

# Configure quantization
quant_config = QuantConfig(
    quant_algo=QuantAlgo.FP8,
    kv_cache_quant_algo=QuantAlgo.FP8
)

# LoRA works with quantized models
llm = LLM(
    model="/path/to/model",
    quant_config=quant_config,
    lora_config=lora_config
)
```

### NeMo LoRA Format

```python
# For NeMo-format LoRA checkpoints
lora_config = LoraConfig(
    lora_dir=["/path/to/nemo/lora"],
    lora_ckpt_source="nemo",
    max_lora_rank=8
)

lora_request = LoRARequest(
    "nemo-task",
    0,
    "/path/to/nemo/lora",
    lora_ckpt_source="nemo"
)
```

### Cache Management

```python
from tensorrt_llm.llmapi.llm_args import PeftCacheConfig

# Fine-tune cache sizes
peft_cache_config = PeftCacheConfig(
    host_cache_size=1024*1024*1024,  # 1GB CPU cache
    device_cache_percent=0.1          # 10% of free GPU memory
)

llm = LLM(
    model="/path/to/model",
    lora_config=lora_config,
    peft_cache_config=peft_cache_config
)
```

## TRTLLM serve with LoRA

### YAML Configuration

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

Create a `config.yaml` file:

```yaml
lora_config:
  lora_target_modules: ['attn_q', 'attn_k', 'attn_v']
  max_lora_rank: 8
```
### Starting the Server
```bash
python -m tensorrt_llm.commands.serve
     /path/to/model \
    --config config.yaml
```

### Client Usage

```python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.completions.create(
    model="/path/to/model",
    prompt="What is the capital city of France?",
    max_tokens=20,
    extra_body={
        "lora_request": {
            "lora_name": "lora-example-0",
            "lora_int_id": 0,
            "lora_path": "/path/to/lora_adapter"
        }
    },
)
```

## TRTLLM bench with LORA

### YAML Configuration

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

Create a `config.yaml` file:

```yaml
lora_config:
  lora_dir:
    - /workspaces/tensorrt_llm/loras/0
  max_lora_rank: 64
  max_loras: 8
  max_cpu_loras: 8
  lora_target_modules:
    - attn_q
    - attn_k
    - attn_v
  trtllm_modules_to_hf_modules:
    attn_q: q_proj
    attn_k: k_proj
    attn_v: v_proj
```
### Run trtllm-bench
```bash
trtllm-bench --model $model_path throughput --dataset $dataset_path --config config.yaml --num_requests 64 --concurrency 16
```

---

# Multimodal Support in TensorRT LLM

TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text.

---

## Background

Multimodal LLMs typically handle non-text inputs by combining a multimodal encoder with an LLM decoder. The encoder first transforms non-text modality input into embeddings, which are then fused with text embeddings and fed into the LLM decoder for downstream inference. Compared to standard LLM inference, multimodal LLM inference involves three additional stages to support non-text modalities.

* **Multimodal Input Processor**: Preprocess raw multimodal input into a format suitable for the multimodal encoder, such as pixel values for vision models.
* **Multimodal Encoder**: Encodes the processed input into embeddings that are aligned with the LLM’s embedding space.
* **Integration with LLM Decoder**: Fuses multimodal embeddings with text embeddings as the input to the LLM decoder.

## Optimizations

TensorRT LLM incorporates some key optimizations to enhance the performance of multimodal inference:

* **In-Flight Batching**: Batches multimodal requests within the GPU executor to improve GPU utilization and throughput.
* **CPU/GPU Concurrency**: Asynchronously overlaps data preprocessing on the CPU with image encoding on the GPU.
* **Raw data hashing**: Leverages image hashes and token chunk information to improve KV cache reuse and minimize collisions.

Further optimizations are under development and will be updated as they become available.

## Model Support Matrix

Please refer to the latest multimodal [support matrix](../models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend).

## Examples

The following examples demonstrate how to use TensorRT LLM's multimodal support in various scenarios, including quick run examples, serving endpoints, and performance benchmarking.

### Quick start

Quickly try out TensorRT LLM's multimodal support using our `LLM-API` and a ready-to-run [example](source:examples/llm-api/quickstart_multimodal.py):

```bash
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality image --disable_kv_cache_reuse
```

### OpenAI-Compatible Server via [`trtllm-serve`](../../source/commands/trtllm-serve/trtllm-serve.rst)

Launch an OpenAI-compatible server with multimodal support using the `trtllm-serve` command, for example:

```bash
trtllm-serve Qwen/Qwen2-VL-7B-Instruct  --backend pytorch
```

You can then send OpenAI-compatible requests, such as via curl or API clients, to the server endpoint. See [curl chat client for multimodal script](source:examples/serve/curl_chat_client_for_multimodal.sh) as an example.

### Run with [`trtllm-bench`](../../source/commands/trtllm-bench.rst)

Evaluate offline inference performance with multimodal inputs using the `trtllm-bench` tool. For detailed instructions, see the [benchmarking guide](../../source/performance/perf-benchmarking.md).

---

# Overlap Scheduler

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

## How It Works

At step *n*, the system launches GPU computation for step *n+1* without waiting for CPU tasks (e.g., stop criteria checks) from step *n* to complete. This allows:

- CPU work (step *n*) and GPU computation (step *n+1*) to run concurrently.
- Better GPU occupancy by reducing idle time.

This concurrent execution pipeline is illustrated in the `PyExecutor`'s logic:

```python
# Schedule and launch GPU work for the current step (n)
scheduled_batch, _, _ = self._schedule()
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
sample_state = self._sample_async(scheduled_batch, batch_outputs)

# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
if self.previous_batch is not None:
    self._process_previous_batch()
```

## Tradeoff

The optimization introduces one extra decoding step but significantly improves throughput.

## Usage

Enabled by default. To disable, set `disable_overlap_scheduler=True` in the configuration.


## References

- [NanoFlow: Towards Optimal Large Language Model Serving Throughput](https://arxiv.org/abs/2408.12757)
- https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler

---

# Paged Attention, IFB, and Request Scheduling

## In-flight Batching

TensorRT LLM supports in-flight batching of requests (also known as continuous
batching or iteration-level batching) for higher serving throughput. With this feature,
sequences in the context phase can be processed together with sequences in the
generation phase. The purpose of that technique is to better interleave
requests to reduce latency as well as make better use of the GPUs.
For efficiency reasons (1), the support for inflight batching ***requires the
input tensors to be packed (no padding)***.

***In the current implementation, the sequences that are going through the
context phase must come before the sequences in the generation phase in the input
tensor. For example, for sequences `S0`, `S1` and `S2`, if `S0` and `S2` are in
context phase (and `S1` in generation), tokens from `S0` and `S2` must appear
before the tokens of `S1` in the input tensor***. The constraint may or may not
be relaxed in a future version.

_(1) Padding sequences in the generation phase that contain a single token to
the length of the maximum input sequence is inefficient use of resources.

### `max_batch_size`, `max_seq_len` and `max_num_tokens`

<p align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/media/max_bs_toks_len.svg?raw=true" alt="Explain `max_batch_size`, `max_seq_len` and `max_num_tokens`" width="30%" height="auto">
</p>

#### `max_batch_size`

`max_batch_size` defines the maximum number of requests that the engine can handle.​

It controls the maximum number of requests that can be scheduled at runtime.

Set a sufficiently high  `max_batch_size` when building the engine so that it does not become the bottleneck of the throughput, and use runtime `max_batch_size` to tune throughput or latency without rebuilding the engine.

#### `max_seq_len`

`max_seq_len` defines the maximum sequence length of single request​

Starting from TensorRT LLM v0.11, when `--remove_input_padding` and `--context_fmha` are enabled, `max_seq_len` can replace `max_input_len` and `max_output_len`, and is set to `max_position_embeddings` by default.

Use default `max_seq_len` (which is `max_position_embeddings`), no need to tune it unless you are very sure what max sequence lengths would be on your workloads. If GPU memory is so limited that it cannot support even one request reaching `max_seq_len`, you need to reduce it.

#### `max_num_tokens`

`max_num_tokens` defines the maximum number of batched input tokens after padding is removed in each batch.​

`max_num_tokens` is set to 8192 by default starting from v0.11. It is recommended to tune --max_num_tokens for optimal performance.

The maximum number of tokens will not take effect when input padding is not removed. When input padding is removed, the tokens from different sequences are
packed together and the maximum number of the tokens can be set to a different
(lower) value, which by default is 8192.

There are two aspects that must be considered. Firstly, some input sequences
will be shorter than the maximum input length. Secondly, when in-flight
sequence batching is enabled, requests in context phase will be executed with
requests in generation phase. Those latter requests produce a lot fewer tokens
than `max_input_len` (at most, `beam_width` tokens).

max_num_tokens affects workspace buffer sizes to be allocated as well as one of the matrix multiplication(s) dimension. Hence, Using a more realistic value for `max_num_tokens` allows TensorRT LLM to
allocate more memory to store the KV cache and execute more requests together.
It leads to an increased efficiency. 

GPUs yield higher utilization with larger matrix multiplications. Hence, Increasing `max_num_tokens` appropriately will be beneficial to performance. At some point, GPU utilization will plateau,
going beyond that saturation point may hurt both first token latency as well as
total end-to-end latency. In summary, One should select reasonably high max_num_tokens for high token throughput/good GPU math utilization  but not very high in order to meet SLO TTFT (time to first token) and TPOT (Time per output token)


## Chunked Context (a.k.a Chunked Prefill)

The original behavior was to process all context tokens at once. However, this feature splits the context into several chunks. In this way, the
context chunks can be batched with more tokens during the generation phase,
which should increase overall throughput. Chunking contexts also removes
constraints on input length. To enable this feature, the FMHA paged kv-cache also
needs to be enabled. Except for the last chunk, the size of each context chunk needs to be an integer multiple of the kv-cache block size.

## KV Cache

In the generation phase, a common optimization is to provide the MHA kernel
with a cache containing the values of the past K and V elements that have
already been computed.  That cache is known as the KV cache. TensorRT LLM uses
that technique to accelerate its generation phase. In TensorRT LLM, there is
one KV cache per Transformer layer, which means that there are as many KV
caches as layers in a model. The current version of TensorRT LLM supports two
different types of KV caches: **contiguous** and **paged** KV caches.

### Contiguous KV Cache

The contiguous KV cache is a monolithic tensor. Its shape is:
```
[max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].
```

This implementation uses much more memory than needed when sequences are shorter than the maximum sequence length. Even if they approach the limit after generating many output tokens, it may take many steps to reach that point.

### Paged KV Cache

The paged KV cache decomposes the KV cache into blocks that are distributed to
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocates new blocks from a pool, and recycles those blocks when required. See the simplified implementation of
[`tensorrt_llm.runtime.KVCacheManager`](source:tensorrt_llm/runtime/kv_cache_manager.py).
A more efficient C++ implementation is included in the
[Batch Manager](source:cpp/include/tensorrt_llm/batch_manager).

## The schedulers

This section visualizes how TensorRT LLM schedules requests based on max-batch size and max-num tokens. The example starts out with a newly initialized engine as well as a few unscheduled requests that have come in. For the sake of this example, toy values are set to `max batch size = 4` and `max num tokens = 12`. Each square block represents a token, and its color represents which request it belongs to.

![TRT-LLM Scheduler Visualization 1](../media/TRTLLM_Scheduler_Vis_1.svg)


Now the scheduler takes the first two requests, Request 1 and Request 2, and schedules them to execute the context phase. However, it cannot schedule any more requests because the prompts of the first two requests had 5 tokens each, leaving a budget of 2 tokens due to the max num tokens limit. Since all remaining requests have more than 2 prompt tokens none of them can be scheduled (context chunking can help in this situation, see the paged context attention section below). The tokens are marked with a "C" on them to represent that they are prompt tokens that were processed in the context phase.

> Note: The tokens for different requests are shown on different rows simply for visualization purposes and are not representative of actual memory layouts

![TRT-LLM Scheduler Visualization 2](../media/TRTLLM_Scheduler_Vis_2.svg)

Now the engine runs an iteration of execution, completing the context phases for both of the scheduled requests. After it is done, the kv-cache of the prompts for both requests have been created and the first token has been generated. Tokens that were generated are marked with "G(n)" - for example a token marked "G1" represents that it is the first token generated for its request.

TRT-LLM prioritizes scheduling requests in generation phase first so the two generated tokens are queued to be processed in the next iteration. Now, since the two previously scheduled requests have entered generation phase and only take up two tokens out of the max num token budget of 12, the scheduler is able to schedule two additional requests, Request 3 and Request 4. It cannot schedule the last request, Request 5, even though there is space for it in the max num tokens budget because of the max batch size limit of 4.

![TRT-LLM Scheduler Visualization 3](../media/TRTLLM_Scheduler_Vis_3.svg)

After the next iteration of execution, the second tokens for Requests 1 and 2 have been generated, and the first tokens for Request 3 and 4 have been generated. Let's say that G2, which was generated for Request 1, is the stop token signifying that Request 1 is completed. In this case the scheduler would evict Request 1 before performing another execution iteration and prepare to return it to the user. This eviction puts the state of the engine below the max batch size limit and allows Request 5 to be scheduled.

Also note that G1, which was generated for Request 2, has been added to the kv-cache for Request 2, illustrating how the kv-cache for a request grows as more tokens are generated.

![TRT-LLM Scheduler Visualization 4](../media/TRTLLM_Scheduler_Vis_4.svg)

Overall, the max batch size and max num tokens limits play a key role in determining when requests are executed. Tuning these parameters can significantly impact throughput, as well as how the engine balances previously scheduled requests in the generation phase with new requests in the context phase.

> Note: This presents a simplified visualization of the scheduler to highlight how max batch size and max num tokens affect it. The scheduler also considers things like amount of free memory available to be used for kv-cache and has other configurable options that can affect its behavior. See the Runtime Flags of the Additional Options page for more on this.

## Revisiting Paged Context Attention and Context Chunking

Previously we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution. Enabling chunking results in lower TTFTs on average across requests but it may increase TTFT for small number of "lucky" requests that would have been processed in a single iteration in absence of chunking.

The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.

![TRT-LLM Scheduler Visualization Chunked Context 1](../media/TRTLLM_Scheduler_Vis_Chunked_Context_1.svg)

This is extremely beneficial for several reasons. First, it eliminates the possibility of requests with large prompts (relative to max num tokens) not being scheduled due to other requests already in-flight. In production workloads, this can mitigate queuing effects and worst case TTFT numbers. Second, it allows for setting smaller values of max num tokens, since you no longer need max num tokens to be at least as large as the longest prompt you want to support. For long-context cases this is extremely important, because setting extremely large values of max-num tokens takes away from memory available to be used as kv-cache. Given that, in the worst-case scenario, chunked context has minimal impact on performance but can significantly benefit it in many situations, NVIDIA recommends that you always enable it.

---

# Parallelism in TensorRT LLM

Parallelism across multiple GPUs becomes necessary when either
* the model cannot fit in a single GPU’s memory, or
* a single GPU cannot deliver the desired performance.

TensorRT LLM supports multiple parallelism strategies for deployment on both single and multiple nodes:
* **Tensor Parallel (TP)** - Shards model weights across GPUs
* **Pipeline Parallel (PP)** - Distributes model layers across GPUs
* **Data Parallel (DP)** - Replicates model across GPUs for different requests
* **Expert Parallel (EP)** - Distributes experts across GPUs for MoE models
* **Context Parallel (CP)** - Distributes context processing across GPUs
* **Wide Expert Parallel (Wide-EP)** - Advanced EP with load balancing for large-scale MoE models

## Overview of Parallelism Strategies

### Tensor Parallelism (TP)
Tensor parallelism splits the model weights across multiple GPUs. Each GPU holds a portion of the weights and processes the same input tokens, with results combined through communication.

**Best for:** Small batch sizes, memory-constrained scenarios

### Pipeline Parallelism (PP)
Pipeline parallelism distributes different layers of the model across multiple GPUs. Each GPU processes a subset of layers, with activations passed between GPUs.

**Best for:** Large models that don't fit in single GPU memory

### Data Parallelism (DP)
Data parallelism replicates the entire model across multiple GPUs. Each GPU processes different requests independently.

**Best for:** Large batch sizes, high throughput scenarios

### Expert Parallelism (EP)
Expert parallelism is specifically designed for Mixture of Experts (MoE) models, where different experts are distributed across GPUs.

**Best for:** MoE models with high expert count

### Context Parallelism (CP)
Context parallelism distributes the processing of long sequences across multiple GPUs.

**Best for:** Long context scenarios

### Wide Expert Parallelism (Wide-EP)
Wide-EP is an advanced form of expert parallelism that addresses the inherent workload imbalance in large-scale MoE models through intelligent load balancing and expert replication.

**Best for:** Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, Qwen3

## Module-level Parallelism Guide

### Attention Module

TensorRT LLM supports two strategies for attention modules:

- **Tensor Parallelism (TP)** — best for small batch sizes
- **Data Parallelism (DP)** — best for large batch sizes

#### Tensor Parallelism (TP)

* The GEMM weights before and after the attention kernel are evenly sharded across GPUs, as are the attention `num_heads`.
* Exceptions:
  1. **DeepSeek-R1**: the `fused_A` GEMM is *not* sharded.
  2. **GQA / MQA / MLA**: if `num_heads < tensor_parallel_size`, the KV-cache is replicated on every GPU.

#### Data Parallelism (DP)

* All GEMM weights are **replicated** on every GPU.
* The KV-cache is **partitioned**, because different user requests are routed to different DP ranks.

#### How to Enable Attention Parallelism

To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml`:

```bash
cat <<EOF > parallel_config.yaml
# TP-8
tensor_parallel_size: 8
enable_attention_dp: false    # default
# DP-8
tensor_parallel_size: 8
enable_attention_dp: true
EOF
```

then set `--config parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.

### FFN Module

#### Dense Models

Tensor Parallelism is supported for the FFN layers of dense models.

#### Mixture of Experts (MoE)

MoE replaces a single FFN with multiple experts. A router selects the top-k experts for each token and dispatches the corresponding hidden states.

TensorRT LLM supports three execution patterns for MoE:

* **TP** - Every expert's weight matrix is sliced across all GPUs. Each GPU sees all tokens.
* **EP** - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts.
* **Hybrid ETP** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.

#### How to Enable MoE Parallelism

To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml` as follows:

```bash
cat <<EOF > parallel_config.yaml
# TP only
tensor_parallel_size: 8
moe_tensor_parallel_size: 8

# EP only
tensor_parallel_size: 8
moe_expert_parallel_size: 8

# Hybrid (TP-4 × EP-2)
tensor_parallel_size: 8      # 4 × 2
moe_tensor_parallel_size: 4
moe_expert_parallel_size: 2
EOF
```
```{note}
The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`.
```

## Wide Expert Parallelism (Wide-EP)

Wide Expert Parallelism (Wide-EP) is TensorRT LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies.

### Motivation for Wide-EP

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges:

- **High memory demands** for expert weights
- **Inherent expert-level workload imbalance** due to sparse execution patterns
- **Communication overhead** in distributed expert parallelism
- **Hot expert problem** where certain experts receive significantly more tokens than others

### Key Features of Wide-EP

#### 1. Expert Replication and Load Balancing
Wide-EP introduces the concept of **expert slots** that are decoupled from specific experts. This allows:
- Multiple replicas of hot experts across different GPUs
- Dynamic expert placement based on workload patterns
- Both offline and online load balancing strategies

#### 2. Custom EP Communication Kernels
- Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
- Efficient all-to-all communication for expert dispatch and combine
- Reduced communication overhead compared to traditional EP

#### 3. Expert Parallelism Load Balancer (EPLB)
- **Offline EPLB**: Pre-computed expert placement based on historical workload statistics
- **Online EPLB**: Dynamic expert placement that adapts to real-time traffic patterns
- Layer-wise weight redistribution to minimize inference disruption

### Architecture Overview

Wide-EP separates the concepts of **experts** and **slots**:
- **Expert**: The concept from the model's perspective (e.g., Expert 0, Expert 1, etc.)
- **Slot**: The concept from the model engine's perspective (e.g., Slot 0, Slot 1, etc.)

The system maintains a routing table that maps Expert IDs to Slot IDs, which can be updated by the load balancing policy.


### Best Practices

1. **Start with offline EPLB** for production deployments with known workload patterns
2. **Use online EPLB** for dynamic workloads or when traffic patterns change frequently
3. **Monitor expert statistics** to understand workload distribution
4. **Tune max_num_tokens** based on your memory constraints and EP size
5. **Test with representative datasets** to validate load balancing effectiveness

### References

- [Technical Blog: Scaling Expert Parallelism in TensorRT LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
- [DeepSeek-V3 Paper](https://arxiv.org/abs/2412.19437)
- [EPLB Implementation](https://github.com/deepseek-ai/EPLB)

For detailed implementation examples and advanced usage, see:
- [`examples/wide_ep/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/): Complete Wide-EP examples
- [`examples/wide_ep/ep_load_balancer/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer/): Load balancing tools
- [`examples/wide_ep/slurm_scripts/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts/): Cluster deployment scripts

---

# Quantization

## Quantization in TensorRT LLM

Quantization is a technique used to reduce memory footprint and computational cost by converting the model's weights and/or activations from high-precision floating-point numbers (like BF16) to lower-precision data types, such as INT8, FP8, or FP4.

TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. These recipes can be broadly categorized as follows:

* FP4
* FP8 Per Tensor
* FP8 Block Scaling
* FP8 Rowwise
* FP8 KV Cache
* NVFP4 KV Cache
* W4A16 GPTQ
* W4A8 GPTQ
* W4A16 AWQ
* W4A8 AWQ


## Usage

The default PyTorch backend supports FP4 and FP8 quantization on the latest Blackwell and Hopper GPUs.

### Running Pre-quantized Models

TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).

```python
from tensorrt_llm import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")
```

#### FP8 KV Cache

```{note}
TensorRT LLM allows you to enable the FP8 KV cache manually, even for checkpoints that do not have it enabled by default.
```

Here is an example of how to set the FP8 KV Cache option:

```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='fp8'))
llm.generate("Hello, my name is")
```

#### NVFP4 KV Cache

To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions.
After the quantization is done, the NVFP4 KV cache option can be set by:

```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='nvfp4'))
llm.generate("Hello, my name is")
```


### Offline Quantization with ModelOpt

If a pre-quantized model is not available on the [Hugging Face Hub](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4), you can quantize it offline using ModelOpt.

Follow this step-by-step guide to quantize a model:

```bash
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8
```

#### NVFP4 KV Cache

To generate the checkpoint for NVFP4 KV cache:

```bash
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4
```

Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, `--quant fp8` is required here.

## Model Supported Matrix

| Model          |  NVFP4  | MXFP4  | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ  | W4A16 AWQ | W4A8 GPTQ  | W4A16 GPTQ |
| :------------- | :---:   | :---:  | :---: | :---: | :---: | :---: |:---:| :-------: | :-------: | :--------: | :--------: |
| BERT           |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| DeepSeek-R1    |   Y     |   .    |   .   |   Y   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| EXAONE         |   .     |   .    |   Y   |   .   |   .   |   Y   |  .  |     Y     |     Y     |     .      |     .      |
| Gemma 3        |   .     |   .    |   Y   |   .   |   .   |   Y   |  .  |     Y     |     Y     |     .      |     .      |
| GPT-OSS        |   .     |   Y    |   .   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| LLaMA          |   Y     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     Y     |     .      |     Y      |
| LLaMA-v2       |   Y     |   .    |   Y   |   .   |   .   |   Y   |  Y  |     Y     |     Y     |     .      |     Y      |
| LLaMA 3        |   .     |   .    |   .   |   .   |   Y   |   Y   |  Y  |     Y     |     .     |     .      |     .      |
| LLaMA 4        |   Y     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| Mistral        |   .     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     Y     |     .      |     .      |
| Mixtral        |   Y     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| Phi            |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     Y     |     .     |     .      |     .      |
| Qwen           |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     Y     |     Y     |     .      |     Y      |
| Qwen-2/2.5     |   Y     |   .    |   Y   |   .   |   .   |   Y   |  .  |     Y     |     Y     |     .      |     Y      |
| Qwen-3         |   Y     |   .    |   Y   |   .   |   .   |   Y   |  Y  |     .     |     Y     |     .      |     Y      |
| BLIP2-OPT      |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| BLIP2-T5       |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |
| LLaVA          |   .     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     Y     |     .      |     Y      |
| VILA           |   .     |   .    |   Y   |   .   |   .   |   Y   |  .  |     .     |     Y     |     .      |     Y      |
| Nougat         |   .     |   .    |   .   |   .   |   .   |   Y   |  .  |     .     |     .     |     .      |     .      |


```{note}
The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default.
The language component decides which quantization methods are supported by a given multi-modal model.
```


## Hardware Support Matrix

| Model          |  NVFP4  | MXFP4  | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ  | W4A16 AWQ | W4A8 GPTQ  | W4A16 GPTQ |
| :------------- | :---:   | :---:  | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
| Blackwell(sm120)       |   Y     |   Y    |   Y   |   .   |   .   |   Y   |   .   |     .     |     .     |     .      |     .      |
| Blackwell(sm100/103)       |   Y     |   Y    |   Y   |   Y   |   .   |   Y   |   Y   |     .     |     .     |     .      |     .      |
| Hopper           |   .     |   .    |   Y   |   Y   |   Y   |   Y   |   .   |     Y     |     Y     |     Y      |     Y      |
| Ada Lovelace          |   .     |   .    |   Y   |   .   |   .   |   Y   |   .   |     Y     |     Y     |     Y      |     Y      |
| Ampere         |   .     |   .    |   .   |   .   |   .   |   Y   |   .   |     .     |     Y     |     .      |     Y      |

```{note}
FP8 block wise scaling GEMM kernels for sm100/103 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
```


## Quick Links

- [Pre-quantized Models by ModelOpt](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)
- [ModelOpt Support Matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html)

---

# Ray Orchestrator (Prototype)

```{note}
This project is under active development and currently in a prototype stage. The current focus is on core functionality, with performance optimization coming soon. While we strive for correctness, there are currently no guarantees regarding functionality, stability, or reliability.
```

## Motivation
The **Ray orchestrator** uses [Ray](https://docs.ray.io/en/latest/index.html) instead of MPI to manage workers for single- and multi-node inference. It’s a first step toward making TensorRT-LLM a better fit for Reinforcement Learning from Human Feedback (RLHF) workflows. For RLHF, Ray can dynamically spawn and reconnect distributed inference actors, each with its own parallelism strategy. This feature is a prototype and under active development. MPI remains the default in TensorRT-LLM.


## Basic Usage
To use Ray orchestrator, you need to first install Ray.
```shell
cd examples/ray_orchestrator
pip install -r requirements.txt
```

To run a simple `TP=2` example with a Hugging Face model:

```shell
python llm_inference_distributed_ray.py
```

This example is the same as in `/examples/llm-api`, with the only change being `orchestrator_type="ray"` on `LLM()`. Other examples can be adapted similarly by toggling this flag.


## Features
Currently available:
- Generate text asynchronously (refer to [llm_inference_async_ray.py](/examples/ray_orchestrator/llm_inference_async_ray.py))
- Multi-node inference (refer to [multi-node README](/examples/ray_orchestrator/multi_nodes/README.md))
- Disaggregated serving (refer to [disagg README](/examples/ray_orchestrator/disaggregated/README.md))

*Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.*

## Roadmap
- Performance optimization
- Integration with RLHF frameworks, such as [Verl](https://github.com/volcengine/verl) and [NVIDIA NeMo-RL](https://github.com/NVIDIA-NeMo/RL).

## Architecture
This feature introduces new classes such as [RayExecutor](/tensorrt_llm/executor/ray_executor.py) and [RayGPUWorker](/tensorrt_llm/executor/ray_gpu_worker.py) for Ray actor lifecycle management and distributed inference. In Ray mode, collective ops run on [torch.distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) without MPI. We welcome contributions to improve and extend this support.

![Ray orchestrator architecture](/docs/source/media/ray_orchestrator_architecture.jpg)

---

# Sampling
The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability and logits processors

## General usage

To use the feature:

1. Enable the `enable_trtllm_sampler` option in the `LLM` class
2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function

The following example prepares two identical prompts which will give different results due to the sampling parameters chosen:

```python
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8',
          enable_trtllm_sampler=True)
sampling_params = SamplingParams(
        temperature=1.0,
        top_k=8,
        top_p=0.5,
    )
llm.generate(["Hello, my name is",
            "Hello, my name is"], sampling_params)
```

Note: The `enable_trtllm_sampler` option is not currently supported when using speculative decoders, such as MTP or Eagle-3, so there is a smaller subset of sampling options available.

## Beam search

Beam search is a decoding strategy that maintains multiple candidate sequences (beams) during text generation, exploring different possible continuations to find higher quality outputs. Unlike greedy decoding or sampling, beam search considers multiple hypotheses simultaneously.

To enable beam search, you must:

1. Enable the `use_beam_search` option in the `SamplingParams` object
2. Set the `max_beam_width` parameter in the `LLM` class to match the `best_of` parameter in `SamplingParams`
3. Disable overlap scheduling using the `disable_overlap_scheduler` parameter of the `LLM` class
4. Disable the usage of CUDA Graphs by passing `None` to the `cuda_graph_config` parameter of the `LLM` class

Parameter Configuration:
- `best_of`: Controls the number of beams processed during generation (beam width)
- `n`: Controls the number of output sequences returned (can be less than `best_of`)
- If `best_of` is omitted, the number of beams processed defaults to `n`
- `max_beam_width` in the `LLM` class must equal `best_of` in `SamplingParams`

The following example demonstrates beam search with a beam width of 4, returning the top 3 sequences:

```python
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8',
          enable_trtllm_sampler=True,
          max_beam_width=4,   # must equal SamplingParams.best_of
          disable_overlap_scheduler=True,
          cuda_graph_config=None)
sampling_params = SamplingParams(
        best_of=4,   # must equal LLM.max_beam_width
        use_beam_search=True,
        n=3,         # return top 3 sequences
    )
llm.generate(["Hello, my name is",
            "Hello, my name is"], sampling_params)
```

## Logits processor

Logits processors allow you to modify the logits produced by the network before sampling, enabling custom generation behavior and constraints.

To use a custom logits processor:

1. Create a custom class that inherits from [`LogitsProcessor`](source:tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method
2. Pass an instance of this class to the `logits_processor` parameter of `SamplingParams`

The following example demonstrates logits processing:

```python
import torch
from typing import List, Optional

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.sampling_params import LogitsProcessor

class MyCustomLogitsProcessor(LogitsProcessor):
    def __call__(self,
        req_id: int,
        logits: torch.Tensor,
        token_ids: List[List[int]],
        stream_ptr: Optional[int],
        client_id: Optional[int]
    ) -> None:
        # Implement your custom inplace logits processing logic
        logits *= logits

llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
sampling_params = SamplingParams(
        logits_processor=MyCustomLogitsProcessor()
    )
llm.generate(["Hello, my name is"], sampling_params)
```

You can find a more detailed example on logits processors [here](source:examples/llm-api/llm_logits_processor.py).

---

# Sparse Attention

- [Background and Motivation](#background-and-motivation)
- [Quick Start](#quick-start)
  - [Python API](#python-api)
  - [Usage with trtllm-bench or trtllm-serve](#usage-with-trtllm-bench-or-trtllm-serve)
- [Developer Guide](#developer-guide)
  - [Architecture Overview](#architecture-overview)
  - [Framework Implementation](#framework-implementation)
  - [Implementing a New Algorithm](#implementing-a-new-algorithm)
    - [1. Configuration Class](#1-configuration-class)
    - [2. Implement the prediction module in Attention Backend](#2-implement-the-prediction-module-in-attention-backend)
    - [3. Manage Auxiliary Memory Pool](#3-manage-auxiliary-memory-pool)
    - [4. Registration and Dispatch](#4-registration-and-dispatch)
- [Summary and Future Work](#summary-and-future-work)
    - [Current Status](#current-status)
    - [Future Work](#future-work)

## Background and Motivation

As Large Language Models (LLMs) are applied to increasingly complex tasks such as long-document summarization, code generation, and autonomous agents, the demand for processing long contexts and extended generation has surged. In Transformer-based models, the attention mechanism's computational complexity and memory usage grow quadratically and linearly with sequence length, respectively. This creates significant bottlenecks in both the **Context (Prefill)** and **Generation (Decode)** phases:

*   **Context Phase**: Processing long prompts requires substantial memory bandwidth and computation, affecting time-to-first-token (TTFT). Since the context phase is typically compute-bound, reducing the computational load here is critical.
*   **Generation Phase**: The Key-Value (KV) cache grows with every generated token, consuming vast amounts of GPU memory and bandwidth. Since the generation phase is usually memory-bound, reducing the memory footprint directly alleviates memory pressure, improves token-to-token latency (TPOT), and allows for larger batch sizes.

Fortunately, key observations indicate that attention scores naturally exhibit sparsity, meaning not all K/V tokens are necessary for attention computation. To enhance the efficiency of long-sequence LLMs, numerous methods have been proposed to optimize performance by leveraging approximate sparse attention. Among those methods, sparsity can be applied to different dimensions of the attention: head dimension, hidden dimension, and sequence dimension. When applying sparsity to the sequence dimension, those methods selectively compute only the most important query-key pairs. This approach can be referred to as token sparsity. Token sparsity has been widely explored in lots of recent academic works, and it is also a kind of structured sparse method that is friendly for GPU. Currently, TensorRT LLM focuses on the sparse attention methods that leverages token sparsity.

Token sparsity can be applied to two distinct aspects of LLM inference:
*   **Sparse Computation**: If a query token does not require the entire history, just skip the computation for irrelevant tokens, thereby reducing attention computational costs.
*   **Sparse KV cache**: Evicts KV tokens from the cache that are not required for future generation steps. This reduces GPU memory usage and lowers computation overhead for subsequent steps.

Both methods can be enabled simultaneously to achieve better performance.

To support these emerging techniques, TensorRT LLM has designed a general, extensible and flexible **sparse attention framework** (which is continuously being optimized) to compatibly integrate advanced sparse algorithms. Currently we can support [RocketKV](https://arxiv.org/pdf/2502.14051) and [DSA](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf).

## Quick Start

This section provides a brief guide on enabling sparse attention in TensorRT LLM, using RocketKV as an example. For more details, please refer to [RocketKV sparse attention](../../examples/sparse_attention/RocketKV.md).

### Python API

To use sparse attention, you need to configure a specific `SparseAttentionConfig` (for example, `RocketSparseAttentionConfig`) and pass it to the `LLM` constructor.

```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import RocketSparseAttentionConfig, KvCacheConfig

# 1. Configure Sparse Attention
# Example: RocketKV configuration
rocket_config = RocketSparseAttentionConfig(
    prompt_budget=2048,
    kt_cache_dtype='float8_e5m2'
)

# 2. Configure KV Cache
# Note: Some sparse algorithms (like RocketKV) may require disabling block reuse
kv_config = KvCacheConfig(enable_block_reuse=False)

# 3. Initialize LLM
llm = LLM(
    model="<path_to_model>",
    backend='pytorch',    # Currently requires the PyTorch backend
    sparse_attention_config=rocket_config,
    kv_cache_config=kv_config,
)

# 4. Generate
prompts = ["To be or not to be..."]
outputs = llm.generate(prompts, SamplingParams(max_tokens=128))
```

### Usage with `trtllm-bench` or `trtllm-serve`

You can enable sparse attention in benchmarking and serving tools by providing a `sparse_attention_config` in an `extra_config.yaml` file.

**extra_config.yaml:**
```yaml
backend: pytorch
attn_backend: TRTLLM
sparse_attention_config: # RocketKV as an example
  algorithm: rocket
  kt_cache_dtype: float8_e5m2
  prompt_budget: 2048
kv_cache_config:
  enable_block_reuse: false
enable_chunked_prefill: false
```

Run the command with the config file:
```bash
trtllm-bench/trtllm-serve --model <model_path> --extra_llm_api_options extra_config.yaml ...
```

For example, users can evaluate a model with trtllm-eval on LongBenchV2 task like this:

```bash
trtllm-eval --model <path_to_model> --extra_llm_api_options extra_config.yaml longbench_v2 --max_output_length 1024 ...
```

## Developer Guide

This section describes the sparse attention framework architecture and guides developers on how to implement new sparse attention algorithms in TensorRT LLM. Unless otherwise specified, this framework primarily targets **MQA/GQA/MLA-based** attention mechanisms.

### Architecture Overview

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/sparse_attention_framework.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 1: The sparse attention framework in TensorRT LLM.</em></sub></p>

Our goal is to design a general, extensible, and flexible sparse attention framework. In this framework, the attention operator provides the unified APIs to support both **sparse computation** and **sparse KV cache** that leverage token sparsity, while the users/developers can only focus on the algorithm of sparse attentions, i.e. how to accurately identify important query-key pairs.

For the generality, TensorRT LLM abstracts sparse attention into a prediction-based workflow: *a prediction module first identifies the sparse indices (tokens/blocks to keep or attend to), which are then used by the subsequent attention operator*. Currently, for standard attention (MQA/GQA), TensorRT LLM supports **sparse KV cache** in the context phase and **sparse computation** in the generation phase. Different KV heads are allowed to use different sparse indices, while Q heads that map to the same KV head share the same sparse pattern. It does **not** yet support sparse computation in the context phase or sparse KV cache in the generation phase.

For the scalability, figure 1 illustrates the overall design. The architecture is built by inheriting from the existing `AttentionBackend` to define algorithm-specific sparse attention backends. Within these backends, `prediction` methods are implemented to generate the corresponding sparse indices. These indices are then passed as arguments to the `AttentionOp` to perform the sparse attention computation. This approach balances system flexibility with extensibility, allowing new algorithms to be integrated by simply defining their prediction logic **without** modifying the core attention kernels.

TensorRT LLM currently supports the following features:

1.  **Context Phase**:
    *   **sparse computation**: MLA
    *   **sparse KV cache**: MQA/GQA

2.  **Generation Phase**:
    *   **sparse computation**: MLA/MQA/GQA
    *   **sparse KV cache**: no support yet

### Framework Implementation

To hide the complexity of sparse algorithms, the main prediction logic is encapsulated within the `tensorrt_llm._torch.attention_backend` module.

We have extended the existing `AttentionBackend` to include a prediction step that retrieves sparse indices before the attention operation. These indices are generated using two prediction methods:

```python
# Predict indices for sparse KV Cache
sparse_kv_indices, sparse_kv_offsets = self.sparse_kv_predict(
    q, k, metadata, **kwargs)

# Predict indices for sparse computation
sparse_attn_indices, sparse_attn_offsets = self.sparse_attn_predict(
    q, k, metadata, **kwargs)
```

The specific prediction logic is hidden in the subclasses, where developers implement `sparse_kv_predict` and `sparse_attn_predict`.

The key files located in `tensorrt_llm/_torch/attention_backend/sparse/` are:

*   `rocket.py`, `dsa.py`: Implementations of specific algorithms (e.g., RocketKV, DSA).
*   `kernel.py`: Custom Triton kernels for importance scoring or selection.
*   `utils.py`: Dispatch related logic.

<div align="center">
<figure>
  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/sparse_attention_op.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 2: Sparse attention operator workflow in TensorRT LLM.</em></sub></p>

In `AttentionOp`, currently, the MQA/GQA sparse attention only supports sparse computation at block granularity in the generation phase, where the block size equals to the page size of the KV cache. It means that we can skip the attention computation of those unimportant pages. In addition, we provide a sparse MLA kernel that supports token-level sparse computation in both the context and generation phases.

To support those features, as illustrated in figure 2, we have implemented two kernels for the MQA/GQA path, `updateSparseKvCacheAfterFmha` and `gatherKvPageOffsetsKernel`, applied in the context and generation phases respectively:

*   **`updateSparseKvCacheAfterFmha`**: Invoked in the post-processing stage after the context attention computation. It selects the important KV tokens and write those K/V vectors to the KV cache to reduce the KV cache size.

*   **`gatherKvPageOffsetsKernel`**: Executed before the attention computation in the generation phase. It converts the input sparse indices (which can be of arbitrary granularity) into page-aligned indices. This means that if a single token is selected, the entire page it is included in the attention computation. After this conversion, we will get a new `kv_page_offsets` and also an updated `kv_len` that is the number of those selected KV tokens. Then these new metadata are fed into the subsequent attention kernel for computation.

For sparse MLA, the kernel supports token sparsity directly, eliminating the need for `gatherKvPageOffsetsKernel`. However, please note that sparse KV cache support is not yet available.

Many sparse attention algorithms also require additional auxiliary memory. In the current system, there are two paths to support this feature:

*   Implement a simple, custom CacheManager at the Python level, inheriting from `KVCacheManager`.

*   Use `KVCacheManagerCpp` to simultaneously manage both the KV Cache and auxiliary memory.

Each option has its own advantages and disadvantages, please refer to the [Manage Auxiliary Memory Pool](#3-manage-auxiliary-memory-pool) for more details.

### Implementing a New Algorithm

#### 1. Configuration Class

Define a configuration class in `tensorrt_llm/llmapi/llm_args.py` inheriting from `BaseSparseAttentionConfig`. This class should hold user-tunable parameters for your algorithm.

```python
@dataclass
class MySparseAttentionConfig(BaseSparseAttentionConfig):
    topk: int = 64
    # ... other parameters
```

#### 2. Implement the prediction module in Attention Backend

Create a new class inheriting from `TrtllmAttention` (in `tensorrt_llm/_torch/attention_backend/trtllm.py`). You typically need to override two main prediction methods:

**`sparse_kv_predict(self, q, k, metadata, **kwargs)`**
*   **Behavior**: This function performs prediction to return the indices of tokens to be preserved in the KV cache.
*   **Output**: 
    - `sparse_kv_indices`: The token indices of the important tokens on sequence dimension, shape `(nHeads, nTokens)`, where `nHeads` is the number of KV heads and `nTokens` is the total number of selected tokens across all samples in the batch.
    - `sparse_kv_offsets`: The offset for the `sparse_kv_indices`, shape `(nBatch + 1)`, where `nBatch` is the number of the batch size. The index for head `h` and sample `n` can be obtained via `sparse_kv_indices[h, sparse_kv_offsets[n]]`.
*   **Constraint**: Returned indices must be **sorted** to ensure safe in-place gathering in memory. Note that this post-processing "gather" step introduces some overhead, but significantly improves flexibility, allowing compatibility with features in context like chunked prefill.

**`sparse_attn_predict(self, q, k, metadata, **kwargs)`**
*   **Behavior**: For the current query tokens, predict and return the sparse indices for sparse computation.
*   **Output**: 
    - `sparse_attn_indices`: The block indices of the block sparse attention on the KV sequence dimension, shape `(nHeads, nBlocks)`, where `nHeads` is the number of KV heads and `nBlocks` is the total number of selected blocks across all samples in the batch. For block sparse attention, the block size is defined by `sparse_attn_indices_block_size`, which supports arbitrary values.
    - `sparse_attn_offsets`: The offset for the `sparse_attn_indices`, shape `(nBatch + 1)`, where `nBatch` is the number of the batch size. The index for head `h` and sample `n` can be obtained via `sparse_attn_indices[h, sparse_attn_offsets[n]]`.
*   **Constraint**: The generation phase sparse computation is supported for NVIDIA Blackwell GPUs and newer (SM 100+) using TRTLLM-GEN kernels. However, it is flexible enough to extend to different architectures. Currently, only KV cache's **page-level** granularity is supported for sparse computation.

**Note**: The prediction process can be time-consuming, especially in low-latency scenarios where it might account for a significant portion of the attention time. It is highly recommended to optimize this step using custom kernels.

#### 3. Manage Auxiliary Memory Pool

Many sparse algorithms (like RocketKV or DSA) require auxiliary structures (e.g., a "KT cache" in RocketKV) to select relevant tokens. There are two primary ways to manage this memory in TensorRT LLM:

**Option A: Python-level Custom Manager**

You can implement a custom manager in Python.
*   **Use Case**: Algorithms like RocketKV use this approach to store the KT cache (e.g., `RocketKVCacheManager` in `rocket.py`).
*   **Implementation**: Create a Python level cache manager that handles the allocation and lifecycle of the auxiliary tensors. It is recommended to use the existing `BlockManager` to manage the auxiliary pools if possible. This allows the auxiliary pool to share block manager logics, reducing implementation overhead.
*   **Key Methods to Override**:
    *   `get_cache_size_per_token` / `get_cache_bytes_per_token`: Update `kv_factor` correctly to include the size of the auxiliary structures so TensorRT LLM allocates sufficient GPU memory.
    *   `add_dummy_requests` / `prepare_resources`: Ensure the auxiliary pool allocates correct resources/tokens for new requests.
*   **Pros**: The custom cache manager is more flexible and easier to implement because it can share the same blocks managed by the `KVCacheManager`.
*   **Cons**: This approach operates at the Python level, making it difficult to share features of the KV cache managed at the C++ level (e.g., advanced transmission or kvcache reuse features tied to the C++ manager).

**Option B: C++ Integrated Manager**

For tighter integration, you can manage the auxiliary memory within the C++ `KVCacheManager`.
*   **Use Case**: Algorithms like DSA use this approach to store the indexer Kcache.
*   **Pros**: Enables compatibility with advanced features such as KV cache reuse and disagg-serving. For example, DSA's low-rank indexer Kcache can be reused or transmitted between context and generation engines.
*   **Cons**: Higher implementation complexity. The current C++ `KVCacheManager` is optimized for the standard KV cache pool. Adding custom pools often requires significant modifications or manual implementation of the pool management logic within the C++ level.

**Note**: If your algorithm involves sparse KV cache, standard KV cache block reuse is generally incompatible because eviction modifies the block content uniquely for each request. However, algorithms like DSA that use low-rank approximation without eviction can support block reuse.

#### 4. Registration and Dispatch

*   Register your config and backend in `tensorrt_llm/_torch/attention_backend/sparse/utils.py` and `tensorrt_llm/_torch/pyexecutor/_util.py` to ensure the system routes the request to your new backend when the config is present.
*   Add initialization logic in `cpp/tensorrt_llm/thop/attentionOp.cpp` and `cpp/tensorrt_llm/kernels/sparseAttentionKernels.h` if new C++ level parameters are required.

## Summary and Future Work

### Current Status

Currently, the status of the sparse attention framework is as follows:

1.  **Supported Operations**: The `AttentionOp` currently supports **sparse KV cache** in the context phase and **sparse computation** in the generation phase. Other combinations (for example, sparse computation in the context phase) are not yet supported for MQA/GQA. For MLA, sparse computation is supported in both the context and generation phases.
2.  **Algorithm Support**: RocketKV is supported in both the vanilla (PyTorch) backend and the TRTLLM backend, while DSA is supported in the TRTLLM backend. These implementations validate the generality and scalability of the framework.

### Future Work

*   **Sparse Computation in Context Phase**: We plan to introduce sparse computation support for the context phase for MQA/GQA, allowing the TensorRT LLM sparse attention framework to cover more scenarios.
*   **Dynamic Eviction in Generation Phase**: Dynamically evicting KV cache blocks during the generation phase poses significant challenges to KV cache flexibility. While difficult to implement in the current framework, block-level eviction appears to be a promising compromise and is under further exploration.
*   **Unified Auxiliary Memory Management**: We are exploring a unified mechanism to manage auxiliary memory pools. This would allow users to define custom auxiliary spaces more flexibly while automatically inheriting advanced features from the KV cache, such as reuse and offloading.
*   **Code Refactoring**: As more sparse attention algorithms are integrated, the framework will undergo refactoring to unify code and improve maintainability.
*   **Optimizations**: We are discussing further optimizations, such as improving DSA performance.

---

# Speculative Decoding

There are two flavors of speculative decoding currently supported in the PyTorch backend:
- The "one model" implementation -- a variant which inserts a drafter directly into the model code as a submodule.
- The "two model" implementation -- a variant which produces draft tokens in the `PyExecutor`. The draft tokens are attached to requests before they are passed
into the target model's `ModelEngine`.

In general, the one model implementation is faster. It's able to achieve better performance in extreme low latency
scenarios because it can launch the entire drafting loop as a single CUDA graph. The trade off is flexibility. The one model implementation
does not support dynamic draft lengths. Additionally, only a subset of models/speculative decoding algorithms support the one model implementation.
The table below enumerates all of the algorithm/model combinations that are supported.

| Speculative Decoding Algorithm | Model                          |
| ------------------------------ | ------------------------------ |
| EAGLE 3                        | Llama 4 Maverick               |
| MTP                            | Deepseek V3/R1                 |
| EAGLE-style MTP                | Deepseek V3/R1                 |

The two model implementation supports the following speculative decoding algorithms:

| Speculative Decoding Algorithm                | Model                                         |
| --------------------------------------------- | --------------------------------------------- |
| EAGLE 3                                       | Llama 4 Maverick, Llama 3.1 8B, Llama 3.3 70B |
| Draft/target                                  | All models                                    |
| NGram                                         | All models                                    |
| User-provided                                 | All models                                    |

## Quick Start

For all speculation algorithms, when speculation is enabled, a single sequence of draft tokens with length `max_draft_len` is created for every request. There is currently no way to dynamically disable speculation, thus speed ups are only observable at low batch sizes.


### Draft/Target

Draft/target is the simplest form of speculative decoding. In this approach, an arbitrary draft model is used to produce draft tokens. It is important to make sure that the draft and target models were trained with the same tokenizer, else the acceptance rate is extremely low and performance is regressed.

```python
from tensorrt_llm.llmapi import DraftTargetDecodingConfig

speculative_config = DraftTargetDecodingConfig(
    max_draft_len=3, speculative_model_dir="/path/to/draft_model")

llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
```

### EAGLE 3

The EAGLE 3 algorithm is described in the paper [EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test](https://arxiv.org/pdf/2503.01840).
TRT-LLM supports a modified version of the algorithm presented in the paper: tree structures for draft sequences are not supported. Instead, each request uses a single sequence of draft tokens with length `max_draft_len`.

The following draft model checkpoints can be used for EAGLE 3:
* Llama 3 variants: [use the checkpoints from the authors of the original EAGLE 3 paper](https://huggingface.co/yuhuili).
* Llama 4 Maverick: [use the checkpoint from the NVIDIA HuggingFace repository](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3).

```python
from tensorrt_llm.llmapi import EagleDecodingConfig

# Enable to use the faster one-model implementation for Llama 4.
eagle3_one_model = False

speculative_config = EagleDecodingConfig(
    max_draft_len=3, speculative_model_dir="/path/to/draft_model", eagle3_one_model=eagle3_one_model)

# Only need to disable overlap scheduler if eagle3_one_model is False.
llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
```

### NGram

The NGram method is an implementation of [this Prompt Lookup Decoding algorithm](https://github.com/apoorvumang/prompt-lookup-decoding).

When the NGram algorithm is used, TRT-LLM will maintain a map from token prefixes to candidate draft sequences. For example, the 3-gram ["The ", " future ", " is"] could map to the draft sequence [" bright", " because"]. The prefixes are token sequences that are extracted from the prompt and the tokens generated by the target model. The NGram pool and matching procedure can be tuned with the following options:

* `max_draft_len`: Maximum draft candidate length.
* `max_matching_ngram_size`: Maximum prompt suffix length to match with keys in the pool.
* `is_public_pool`: If true, a single ngram pool is shared for all requests. Otherwise, each request has its own ngram pool.
* `is_keep_all`: If true, draft candidates will be retained in the pool forever. Otherwise, only the largest draft candidate is retained.
* `is_use_oldest`: If true, the oldest draft candidate is always proposed for a given match. Otherwise, the newest draft candidate is used. Only applicable if `is_keep_all == True` because `is_keep_all == False` means we'll only ever have a single value for each key.

```python
from tensorrt_llm.llmapi import NGramDecodingConfig

speculative_config = NGramDecodingConfig(
    max_draft_len=3, max_matching_ngram_size=4, is_public_pool=True)

llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
```

### MTP

MTP is currently only supported by Deepseek. MTP can be tuned with the following configuration options:

* `max_draft_len`: Maximum draft candidate length.
* `num_nextn_predict_layers`: Number of MTP modules to use. Currently must match `max_draft_len`.
* `use_relaxed_acceptance_for_thinking`: If true, use relaxed decoding for reasoning models in the thinking phase. In this mode, speculation requirements are relaxed for the thinking phase - a draft token may be accepted if it appears in a candidate set constructed with `relaxed_topk` and `relaxed_delta`.
* `relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
* `relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.

```python
from tensorrt_llm.llmapi import MTPDecodingConfig

speculative_config = MTPDecodingConfig(
    max_draft_len=3, num_nextn_predict_layers=3)

llm = LLM("/path/to/deepseek_model", speculative_config=speculative_config)
```

### User-provided drafting
A completely user-defined drafting method can be supplied with a `UserProvidedDecodingConfig` that includes
* `max_draft_len`: Maximum draft candidate length.
* `drafter`: An object of type `Drafter` that implements the `prepare_draft_tokens` method (see [Developer Guide](speculative-decoding.md#developer-guide) 7.)
* `resource_manager`: An optional `ResourceManager` object (see [Developer Guide](speculative-decoding.md#developer-guide) 4.)

```python
from tensorrt_llm.llmapi import UserProvidedDecodingConfig

speculative_config = UserProvidedDecodingConfig(
    max_draft_len=3, drafter=MyDrafter())

llm = LLM("/path/to/target_model", speculative_config=speculative_config)
```

## Usage with `trtllm-bench` and `trtllm-serve`

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

Speculative decoding options must be specified via `--config config.yaml` for both `trtllm-bench` and `trtllm-serve`. All speculative decoding options can be specified in this YAML file. An additional `decoding_type` option is used to specify the type of speculation to use. The available options are:

* `MTP`
* `Eagle` (for EAGLE 3)
* `NGram`
* `DraftTarget`

The rest of the argument names/valid values are the same as in their corresponding configuration class described in the Quick Start section. For example, a YAML configuration could look like this:

```
disable_overlap_scheduler: true
speculative_config:
  decoding_type: Eagle
  max_draft_len: 4
  speculative_model: /path/to/draft/model
```

## Developer Guide

This section describes the components of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/speculative/interface.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/interface.py).

1. `SpeculativeDecodingMode`: this is a simple `IntEnum`, one for each supported algorithm. There are a few
nontrivial methods, however.
- `needs_kv_cache_rewind`. See "KV Cache Rewind" below. In general, this is true for all two model speculative
decoding algorithms.
- `extend_ctx`: If true, the speculative decoding dispatches requests with `py_draft_tokens` attached to them
to the *prefill* version of the attention kernels. This usually needs to be true. The exception is when you're on
Blackwell using the TensorRT LLM attention backend. In that case, use the generation kernels for better performance.
This optimized kernel has one limitation; all draft lengths must be the same (or padding must be used) in this case.

> *These may be refactored in the future to reduce the difficulty of adding a new speculative
decoding algorithm. `extend_ctx` in particular is problematic. Ideally, we would move all of the kernel dispatching logic
to a lower level of abstraction.*

2. `SpecMetadata`: Defines all metadata that should be passed to the model during the forward pass to facilitate speculative decoding.
Each speculative decoding algorithm defines a subclass of `SpecMetadata`. Similar to `AttentionMetadata`, each `CUDAGraphRunner` owns
its own `SpecMetadata`, and CUDA-graph compatible `SpecMetadata` objects may be created by invoking `create_cuda_graph_metadata(batch_size)`.
`SpecMetadata` has many fields. Many of them are exclusively used by the one model implementation. For the two model implementation, the
main purpose of `SpecMetadata` is to facilitate the capture of hidden states. In EAGLE 3, we need to capture hidden states from the
target model to use as draft model inputs. The `SpecMetadata` stores a list of layers to capture and the model calls
`maybe_capture_hidden_states(layer_id, hidden_states, residual)` during its forward pass. If the layer ID is in the list of layers to capture,
the hidden states are saved. For CUDA graph compatibility, these may be saved in pre-allocated buffers.

`SpecMetadata` is derived from a `SpecConfig` object in `_torch/speculative/utils.py`. There are a few other optional components created in
this file too:

4. `ResourceManager`: Create a custom resource manager to prepare and free resources before and after target forward passes; see
the section on `ResourceManager` in `arch.md`. This is used by the n-gram method to manage its pool. The one model implementation also uses
`ResourceManager`s to manage hidden states.

5. `Sampler`: Each speculative decoding algorithm can optionally create its own sampler. This is mostly used by the one model implementation.
The default `TorchSampler` is used as a fallback if no custom sampler is provided. EAGLE 3 two model also has a simple custom decoder to handle
differences in the draft/target model vocab sizes.

6. `Worker`: This is exclusive to the one-model implementation. The `Worker` is the object that gets injected into the target model as a
submodule.

7. `Drafter`: All of the logic required to actually produce draft tokens should be implemented in a `Drafter` subclass. There is a single
abstract method, `prepare_draft_tokens`. It takes a set of requests (a `ScheduledRequests` object) and returns nothing. The [`PyExecutor`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/py_executor.py#L162) expects
draft tokens to be attached to the `py_draft_tokens` field of request that speculation is to be done for.

## Two Model Speculative Decoding Architecture

Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.

In this approach, there are two new steps to the `PyExecutor`'s `_executor_loop`.
* `_prepare_draft_requests`
* `_prepare_draft_tokens`

### `_prepare_draft_requests`

This stage occurs for all speculative decoding algorithms before scheduling. The purpose
of this stage is to make the KV cache and scheduler aware of the fact that speculative decoding
will occur. Draft tokens take up extra KV cache pages and count towards the executor's
`max_num_tokens` limit. Thus, we need a way to tell the scheduler that drafting will occur
**before we do the scheduling**.

To achieve this, we simply attach the maximum number of draft tokens to each request. The
scheduler and KV cache manager will automatically account for tokens attached to the
`py_draft_tokens` attribute.

```python
for req in self.active_requests:
    req.py_draft_tokens = [0] * max_draft_len
```

### `_prepare_draft_tokens`

This stage occurs after scheduling and KV cache allocation. The purpose of this stage
is to attach draft tokens to the `py_draft_tokens` attribute. This occurs by calling `self.drafter.prepare_draft_tokens`;
each speculative decoding algorithm should have a concrete instance of the `Drafter` class associated with it that defines
the drafting logic.

In addition to producing all "real" draft tokens, `_prepare_draft_tokens` currently must also pad
all `py_draft_tokens` to the maximum draft length. This is a CUDA graph limitation - the target
model captures its CUDA graphs using the maximum number of draft tokens on each request.

### Verification and Sampling

Once the draft tokens are obtained, the target model runs a forward pass through the usual flow.
Everything is the same, except that the logits for all the draft tokens are returned and passed
to the sampler.

Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if
matches the previously decoded token exactly. For example, suppose there is a generation request
`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens. Suppose the token after `t` is `d1`
(determined with the `argmax` of the logits). `d1` is then accepted. If the token after `d1` is `d2`,
then `d2` can be accepted. And so on until draft tokens cannot be accepted anymore.

### KV Cache Rewind

KV cache space allocated to rejected tokens is freed before the next iteration. This is achieved by setting
the `request.py_rewind_len` attribute to `num_draft_tokens_allocated - num_accepted_tokens`. The pages are
freed as part of the `resource_manager.free_resources` routine.

The purpose of KV cache rewind is to avoid complicated page reuse logic in the KV cache manager's `prepare_resources`
function. In practice, this is very cheap since the blocks are just marked as available; no memory is actually freed.

---

# Torch Compile & Piecewise CUDA Graph

In this guide, we show how to enable torch.compile and Piecewise CUDA Graph in TensorRT LLM. TensorRT LLM uses torch.compile for lightweight vertical fusion and Piecewise CUDA Graph.

Piecewise CUDA Graph is a technique that runs cudagraph-unsupported components (primarily attention) in eager mode while capturing and replaying the supported parts with CUDA Graph to reduce context-phase launch overhead. We implement this on top of torch.compile because partitioning a model between CUDA Graph and eager execution—and managing graphs in pure eager mode—is cumbersome.

## Table of Contents

- [Torch Compile & Piecewise CUDA Graph](#torch-compile--piecewise-cuda-graph)
  - [Table of Contents](#table-of-contents)
  - [Usage](#usage)
  - [Tips for Piecewise CUDA Graph](#tips-for-piecewise-cuda-graph)
    - [Piecewise CUDA Graph & Generation Only CUDA Graph](#piecewise-cuda-graph--generation-only-cuda-graph)
    - [Piecewise CUDA Graph Padding](#piecewise-cuda-graph-padding)
    - [Performance Tuning](#performance-tuning)
  - [Known Issue](#known-issue)
  - [Development Guide](#development-guide)
    - [Background Knowledge](#background-knowledge)
      - [Custom Op](#custom-op)
      - [Current Status](#current-status)
    - [TensorRT LLM Custom Backend](#tensorrt-llm-custom-backend)
      - [Torch IR Optimization](#torch-ir-optimization)
      - [ATen IR Optimization](#aten-ir-optimization)
        - [Operation Fusion](#operation-fusion)
        - [Re-inplace Optimization](#re-inplace-optimization)
        - [Auto Multi-stream](#auto-multi-stream)
      - [Piecewise CUDA Graph](#piecewise-cuda-graph)
    - [Common Trace Failure](#common-trace-failure)
    - [Graph Break](#graph-break)
    - [Recompilation](#recompilation)

## Usage

To enable torch.compile and Piecewise CUDA Graph, add the following configuration to `config.yml`. Typically, the `config.yml` can be used by adding launching args `--config config.yml` to `trtllm-serve` or `trtllm-bench`.

```{eval-rst}
.. include:: ../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

```yaml
... # Other extra config
torch_compile_config:
  capture_num_tokens: '${capture_num_tokens}' # List of num tokens to capture. e.g., [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, ..., 3072]
  enable_userbuffers: false
  enable_piecewise_cuda_graph: true
```

## Tips for Piecewise CUDA Graph

### Piecewise CUDA Graph & Generation Only CUDA Graph

Piecewise CUDA Graph only handles context-only and mixed context+generation iterations, while the generation-only CUDA Graph only handles pure generation iterations. Users need to specify the number of tokens to capture for each type of CUDA Graph separately in the extra config. Currently, the default value for `capture_num_tokens` is `[2**i for i in range(8)] + [i for i in range(256, 3073, 256)]`. However, this configuration should be tuned based on specific hardware, model, and parallel strategy. For guidance on tuning these values, see the [Performance Tuning](#performance-tuning) section below.

```yaml
cuda_graph_config:
  enable_padding: true
  max_batch_size: 1024 # Specify max capture batch size for generation only cuda graph. By default, TensorRT LLM will generate a capture list based on it.

torch_compile_config:
  capture_num_tokens: '${capture_num_tokens}' # Specify capture_num_tokens for piecewise cuda graph
  enable_userbuffers: false
  enable_piecewise_cuda_graph: true
```

### Piecewise CUDA Graph Padding

Padding means that, at runtime, the token count is padded to the next captured token count. Unlike the generation-only CUDA Graph, padding is mandatory for Piecewise CUDA Graph because context-phase token counts vary widely, making it impractical to capture graphs for every possible length.

### Performance Tuning

Piecewise CUDA Graph uses a token-count–based capture strategy: it captures a CUDA graph for each user-specified token count and, at runtime, selects and replays the graph that matches the iteration’s token count(or can be padded to the next captured token count graph) in a single forward pass.

Piecewise CUDA Graph primarily benefit host-bound iterations in the context phase. Within a single iteration, larger token counts reduce exposure to host-side overhead. However, capturing a broader set of token counts increases GPU memory usage and can reduce achievable concurrency. We recommend manually tuning `capture_num_tokens` to balance latency, memory footprint, and concurrency for your workload.

Guidelines for `capture_num_tokens`:

- Define bounds:
  - Lower bound: base it on typical context lengths. In low-latency workflows with KV-cache reuse, it can be as small as <10 tokens.
  - Upper bound: set by hardware and model configuration—choose the largest token count that still provides a measurable benefit from Piecewise CUDA Graph even after padding.
- Choose step size: Choose step sizes that balance coverage and memory overhead. Use denser steps in a smaller number of token ranges, and a fixed step (e.g., 256) for larger ranges.
- Manage trade-offs: more capture points reduce padding but increase memory use and can lower max concurrency; fewer points save memory but increase padding and compute cost.

Even with Piecewise CUDA Graph enabled, you may still observe bubbles in the context (prefill) phase, primarily due to the attention operator’s substantial host-side overhead.

## Known Issue

Torch compile cannot work with multi-ModelEngine config.

1. Speculative Decoding in Two-Model Style

``` yaml
speculative_config:
  decoding_type: "MTP"
  mtp_eagle_one_model: False # Not supported

speculative_config:
  decoding_type: "Eagle"
  eagle3_one_model: False # Not supported
```

2. Multimodal Model Family

## Development Guide

### Background Knowledge

Currently, TRT-LLM mainly relies on torch.compile **fullgraph** mode to enable Piecewise CUDA Graph feature, which means all the operations in the model must be recognized by torch.compile.

#### Custom Op

For ops that cannot be represented by a torch native op, developers need to wrap them into a custom op so that they can work properly with torch.compile. A custom op mainly contains two parts: Op forward implementation & Fake kernel.

1. Op forward implementation: Define how this op does forward calculation. Including custom CUDA kernel, etc.
2. Fake kernel: Help torch.compile to do the output tensor dtype/shape inference.

After wrapping the op into a torch custom op, the implementation is a completely **black box** for torch compile. Instead, torch.compile will fully rely on a fake kernel to do the tracing.

Below is a simple example of flashinfer op’s fake kernel.

```python
@torch.library.custom_op("trtllm::flashinfer_silu_and_mul", mutates_args=())
def flashinfer_silu_and_mul(x: torch.Tensor) -> torch.Tensor:
    return silu_and_mul(x, enable_pdl=ENABLE_PDL)

@flashinfer_silu_and_mul.register_fake
def _(x: torch.Tensor) -> torch.Tensor:
    return torch.empty_like(x).chunk(2, dim=-1)[1].contiguous()
```

For more examples, please refer to `tensorrt_llm/_torch/custom_ops`.

#### Current Status

For hot models like deepseek/qwen/lllama, we’ve already wrapped some large modules into a custom op to avoid trace failure/graph breaks and exclude output projection & MTP from torch.compile's scope.

This means developing the inside attention custom op part, the MoE routed export part, and the MPT part don’t need to worry about complex torch.compile constraints since they are treated as a black box for Torch compile. Developers should only make sure the fake kernels of attention custom op, and routed expert are aligned with the actual implementation.


<div align="center">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/current_model_definition_ds.svg" alt="Current Model Status" width=50% height=50% />
</div>
<p align="center"><sub><em>Figure 1. The current model definition for DeepSeek</em></sub></p>

Reasons to wrap attention into a large custom op:

1. The C++ attention op interface is too complex. The argument number exceeds the torch custom op’s limitation
2. MLA has a slice to dispatch the MLA ctx & gen kernel. This introduces dynamic shapes, which may introduce recompilation in the real inference
3. Clear the boundary of attention so that it can be easily recognized by Piecewise CUDA Graph
4. Use some operators that will cause a graph break and are hard to avoid

Reasons to wrap MoE into a large custom op:

1. Use a lot of deepep ops that didn’t wrap into custom ops
2. Hard to support chunked MoE since it uses loops with data-dependent iteration counts, which forces Dynamo to unroll extensively and significantly slows compilation

For the op outside of attention and MLP, the developer should obey the torch.compile constraints. E.g., layernorm, allreduce, etc…

### TensorRT LLM Custom Backend

<div align="center">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/custom_backend_overview.svg" alt="Custom Backend Overview"/>
</div>
<p align="center"><sub><em>Figure 2. TensorRT LLM Custom torch.compile Backend Overview</em></sub></p>

Above is the overview of the TensorRT LLM custom backend for `torch.compile`.

#### Torch IR Optimization

Torch IR is the Fx graph that is directly traced by Torch Dynamo. It has several important features for us to do some graph rewriting and get information:

1. Preserve the operations as is: We can easily find a specific operation and then transform it to arbitrary operations. No need to deal with `auto_functionalize`, etc.
2. Preserve original variable tensor name in the Fx graph: For Piecewise CUDA Graph, it needs to find the correct `SymInt` which represents the token number. Hence, we rely on the `input_ids`'s shape to make it find the `SymInt` correctly.

#### ATen IR Optimization

We get ATen IR after explicitly calling `aot_module_simplified` on the Fx graph. ATen IR is

1. In SSA format (no input mutations)
2. Strict subset of aten op (<250): In Torch IR, Python native add op, `torch.Tensor().add()`, `torch.aten.add.Tensor` could be three different ops. After the transform, they will be the same op.
3. Guaranteed metadata information, e.g., dtype and shape propagation

On this IR level, TensorRT LLM will do the following optimization

##### Operation Fusion

All fusions are located in `tensorrt_llm/_torch/compilation/patterns` and implemented using torch.compile’s [pattern matcher](https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html). Unlike the official approach, we write source patterns directly in a lower-level IR instead of relying on tracing. This avoids:

1. Inadequate handling of scalars and lists:
   - Scalars get specialized into the traced pattern, forcing one pattern per value—impractical and non-general.
   - Lists are flattened, turning elements into separate input arguments, making it impossible to match the original operation.
2. Trace-driven pitfalls: Because it’s trace-based, the generated source patterns may not meet our needs and can introduce additional issues as we expand pattern coverage.

We mainly do the operation fusion for AllReduce & RMSNorm.

1. AllReduce related fusion: Fuse the following operations into one AllReduce op.
   + AllReduce + Residual + RMSNorm
   + AllReduce + Residual + RMSNorm + FP8 Quantization
   + AllReduce + Residual + RMSNorm + FP4 Quantization
2. AllReduce with User Buffer: Converts AllReduce operations to use userbuffers to avoid extra copy overhead.

We enable these fusions in torch.compile because they’re difficult to express in eager mode. For the AllReduce + RMSNorm fusion, which is cross-module, implementing it in eager mode would require moving code between modules, leading to redundant, complex, and hard-to-maintain logic.

For user buffers, torch.compile provides a global, flattened view of the model, making it easy for us to manage user buffers.

##### Re-inplace Optimization

Because ATen IR is SSA, in-place operations are rewritten as out-of-place via a mutation wrapper (`auto_functionalize` or `auto_functionalize_v2` ). That wrapper can introduce an extra tensor copy on mutates args. In a TorchInductor pipeline, later passes typically eliminate this copy, but TensorRT LLM relies on custom ops and does not use Inductor. To avoid the redundant overhead, we remove the wrapper ourselves and preserve the intended in-place update.

##### Auto Multi-stream

Currently torch.compile won't create a subgraph for user user-defined CUDA stream. Instead, it will convert it to `set_stream`. The set_stream op doesn't have any consumers, so it will be removed in the Torch IR to ATen IR transformation, thus losing all the multi-stream scheduling.

To address this, we implemented an auto multi-stream scheduler:

1. Builds a DAG of the FX graph with explicit dependencies, including special handling for in-place ops

2. Computes a critical path using a rough cost model

3. Schedules nodes onto up to `max_num_streams` specified by user config

4. Insert multi-stream related custom op: since the Fx graph executes operators in list order, so we insert streaming-control operators directly into the graph. Moreover, as these operators have no users, we cannot perform dead-code elimination after multi-stream scheduling. Below is an example of multi-stream, which `trtllm.dsv3_router_gemm_op.default` and `trtllm.silu_and_mul.default` + `trtllm.fp4_quantize.default` execute in parallel.

   ```
   call_function  record_event                             trtllm.record_event                          (1,)                                                                                   {}
   call_function  fp4_quantize_2                           trtllm.fp4_quantize.default                  (mm_1, arg18_1, 16)                                                                    {}
   call_function  getitem_9                                <built-in function getitem>                  (fp4_quantize_2, 0)                                                                    {}
   call_function  getitem_10                               <built-in function getitem>                  (fp4_quantize_2, 1)                                                                    {}
   call_function  nvfp4_gemm_2                             trtllm.nvfp4_gemm.default                    (getitem_9, arg19_1, getitem_10, arg20_1, arg21_1, torch.bfloat16)                     {}
   call_function  permute_2                                aten.permute.default                         (arg17_1, [1, 0])                                                                      {}
   call_function  record_event_1                           trtllm.record_event                          (0,)                                                                                   {}
   call_function  silu_and_mul_1                           trtllm.silu_and_mul.default                  (nvfp4_gemm_2,)                                                                        {}
   call_function  fp4_quantize_3                           trtllm.fp4_quantize.default                  (silu_and_mul_1, arg22_1, 16)                                                          {}
   call_function  getitem_11                               <built-in function getitem>                  (fp4_quantize_3, 0)                                                                    {}
   call_function  record_event_2                           trtllm.record_event                          (4,)                                                                                   {}
   call_function  getitem_12                               <built-in function getitem>                  (fp4_quantize_3, 1)                                                                    {}
   call_function  record_event_3                           trtllm.record_event                          (3,)                                                                                   {}
   call_function  set_stream                               trtllm.set_stream                            (1,)                                                                                   {}
   call_function  wait_event                               trtllm.wait_event                            (0,)                                                                                   {}
   call_function  wait_event_1                             trtllm.wait_event                            (1,)                                                                                   {}
   call_function  dsv3_router_gemm_op                      trtllm.dsv3_router_gemm_op.default           (mm_1, permute_2, None, torch.float32)                                                 {}
   call_function  record_stream                            trtllm.record_stream                         (permute_2, 1)                                                                         {}
   call_function  record_stream_1                          trtllm.record_stream                         (mm_1, 1)                                                                              {}
   call_function  record_event_4                           trtllm.record_event                          (2,)                                                                                   {}
   call_function  set_stream_1                             trtllm.set_stream                            (0,)                                                                                   {}
   call_function  wait_event_2                             trtllm.wait_event                            (2,)
   ```

#### Piecewise CUDA Graph

We implement Piecewise CUDA Graph execution on top of torch.compile: non-capturable regions run in eager mode, while the rest of the model is captured and replayed as CUDA Graph segments.

In the current design, we assume the attention block is the only non-capturable component. To maintain stable input pointers across segment boundaries, we convert attention to an in-place variant. Instead of allocating its own output, attention writes results into a tensor preallocated by the preceding CUDA Graph segment. This guarantees that each segment’s inputs are allocated by CUDA Graph and, therefore, stable for that segment’s capture.

<div align="center">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/piecewise_runner.svg" alt="Piecewise Runner" width=35% height=35% />
</div>
<p align="center"><sub><em>Figure 3. Piecewise Runner</em></sub></p>

Notes:

1. Attention **MUST NOT** have any output. The output tensor should be allocated by CUDA Graph.
2. Each sub-cudagraph **MUST** have at least one input tensor that contains the number of tokens in the shape.
3. Only allow dynamic shape for `num_of_tokens` dim.

### Common Trace Failure

1. Custom op fake kernel: For every custom op, developers must implement a correct fake kernel. **Make sure to update the corresponding fake kernel when the custom op is changed**
2. Dynamic Iteration Number Loop: This is technically not a trace failure, but it will introduce long-time tracing that is generally not acceptable. When torch.compile tries to convert PyTorch modeling code to Fx graph, it will try to unroll the loop. For a loop that has a large and dynamic loop number with a large loop body, the tracing process will take a long time to do the unrolling.
   1. If the IO of the loop can be easily written into a custom op format, try to replace it with a custom op
   2. If the loop num is unchanged during the whole inference service lifetime, then it is ok to leave the loop as is. (e.g., Model decoder layer loop)

### Graph Break

1. Use unsupported operators

   + python native operators: `print`, `sys.intern()`, etc.
   + pybind/nanobind operators
     + **Solution:** Wrap them to torch's custom op. For complex operators like attention that exceed the argument limit of PyTorch’s custom-op interface, wrap them in a higher-level module to reduce the argument count.
   + Some of the torch operators:
     + `torch.nonzeros()`: Produce data-dependent dynamic shape tensor
     + `torch.sym_min`: `SymInt` aware min
     + `torch.Tensor.tolist()`, `torch.Tensor.item()`
     + **Solution:** Use them inside a custom op if these operators don't get involved in producing the custom op's output tensor.

2. Use a custom object’s method: For a class like mapping config, we cannot directly use its method like has_pp() in the model forward.

   + **Solution**: We should convert it to a bool in the model init and use the bool.

   ```python
   class Mapping(object):
       def __init__(self, ...):
           ...

       def has_pp(self): # Cannot use this method in torch.compile
           return self.pp_size > 1
   ```

3. Data Dependent Control(DDC) flow involved in code

   + **Solution**: Try to avoid DDC in the code. Try to pre-compute the result outside of torch.compile's scope. For the following example, try to pre-compute the `torch.sum(data)` at the data preparation stage, and pass the result to the `forward`.

   ```python
   class TestCase(torch.nn.Module):
       def __init__(self):
           super().__init__()

    def forward(self, x, data):
        y = x ** 2
        if torch.sum(data) >= 4: # Data Dependent Control Here!
            t =  y
        else:
            t = y / 2
        t = t + 10
        return t

   test_case = TestCase()
   test_case = torch.compile(test_case, backend=Backend())
   x = torch.randn(5).cuda()
   data = torch.ones(2, dtype=torch.int32)
   data[0] = 2
   data[1] = 2
   test_case(x, data)
   ```

### Recompilation

1. Try not to use data-dependent dynamic shapes in the model forward. (e.g., slice the tensor based on input value). This will introduce 0/1 specialization to the model and will possibly introduce recompile.

   1. **0/1 specialization**: torch.compile will recompile the model if a dynamic tensor’s dim equals 0 or 1. In the worst case, it will recompile 3 times for 1 dimension: 0,1, >2

2. For an int argument that would change during runtime, use `SymInt` rather than int in the C++ custom op definition. Otherwise, it will trigger a recompile when the value changes.

   ```c++
   TORCH_LIBRARY_FRAGMENT(trtllm, m)
   {
       m.def("allgather(Tensor input, SymInt[]? sizes, int[] group) -> Tensor");
       m.def("allgather_list(Tensor[] input_list, SymInt[]? sizes, int[] group) -> Tensor[]");
   }
   ```

3. Some recompiles that are hard to aware:

   1. python native `min(list)`, `max(list)`: it will recompile when the list elements changes

   2. Control Flow based on dynamic shape

   3. Next power of two: Previously, we used `bit_length()` to implement the next power of 2 function. However, it will cause a recompile for every int value. Now rewrite the code to be torch.compile-friendly.

      ```python
      def next_positive_power_of_2(x: int) -> int:
          if x < 1:
              return 1

          # Following code is equivalent to 1 << (x - 1).bit_length()
          # But this impl does not contain bit_length(), so it can be used by torch compile.
          # It can correctly handle 64-bit numbers, which should be enough for now.
          n = x - 1
          n |= n >> 1
          n |= n >> 2
          n |= n >> 4
          n |= n >> 8
          n |= n >> 16
          n |= n >> 32
          return n + 1
      ```

---

(build-from-source-linux)=

# Building from Source Code on Linux

This document provides instructions for building TensorRT LLM from source code on Linux. Building from source is recommended for achieving optimal performance, enabling debugging capabilities, or when you need a different [GNU CXX11 ABI](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html) configuration than what is available in the pre-built TensorRT LLM wheel on PyPI. Note that the current pre-built TensorRT LLM wheel on PyPI is linked against PyTorch 2.7.0 and subsequent versions, which uses the new CXX11 ABI.


## Prerequisites

Use [Docker](https://www.docker.com) to build and run TensorRT LLM. Instructions to install an environment to run Docker containers for the NVIDIA platform can be found [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT LLM Python wheel, you first need to clone the TensorRT LLM repository:

```bash
# TensorRT LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
```

## Building a TensorRT LLM Docker Image

There are two options to create a TensorRT LLM Docker image. The approximate disk space required to build the image is 63 GB.

### Option 1: Build TensorRT LLM in One Step

```{tip}
:name: build-from-source-tip-release-container
If you just want to run TensorRT LLM, you can instead [use the pre-built TensorRT LLM Release container images](containers).
```

TensorRT LLM contains a simple command to create a Docker image. Note that if you plan to develop on TensorRT LLM, we recommend using [Option 2: Build TensorRT LLM Step-By-Step](#option-2-build-tensorrt-llm-step-by-step).

```bash
make -C docker release_build
```

You can add the `CUDA_ARCHS="<list of architectures in CMake format>"` optional argument to specify which architectures should be supported by TensorRT LLM. It restricts the supported GPU architectures but helps reduce compilation time:

```bash
# Restrict the compilation to Ada and Hopper architectures.
make -C docker release_build CUDA_ARCHS="89-real;90-real"
```

After the image is built, the Docker container can be run.

```bash
make -C docker release_run
```

The `make` command supports the `LOCAL_USER=1` argument to switch to the local user account instead of `root` inside the container.  The examples of TensorRT LLM are installed in the `/app/tensorrt_llm/examples` directory.

Since TensorRT LLM has been built and installed, you can skip the remaining steps.

(option-2-build-tensorrt-llm-step-by-step)=
### Option 2: Container for building TensorRT LLM Step-by-Step

If you are looking for more flexibility, TensorRT LLM has commands to create and run a development container in which TensorRT LLM can be built.

```{tip}
:name: build-from-source-tip-develop-container
As an alternative to building the container image following the instructions below,
you can pull a pre-built [TensorRT LLM Develop container image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel) from NGC (see [here](containers) for information on container tags).
Follow the linked catalog entry to enter a new container based on the pre-built container image, with the TensorRT source repository mounted into it. You can then skip this section and continue straight to [building TensorRT LLM](#build-tensorrt-llm).
```

**On systems with GNU `make`**

1. Create a Docker image for development. The image will be tagged locally with `tensorrt_llm/devel:latest`.

    ```bash
    make -C docker build
    ```

2. Run the container.

    ```bash
    make -C docker run
    ```

    If you prefer to work with your own user account in that container, instead of `root`, add the `LOCAL_USER=1` option.

    ```bash
    make -C docker run LOCAL_USER=1
    ```

If you wish to use enroot instead of docker, then you can build a sqsh file that has the identical environment as the development image `tensorrt_llm/devel:latest` as follows.

1. Allocate a compute node:
    ```bash
    salloc --nodes=1
    ```

2. Create a sqsh file with essential TensorRT LLM dependencies installed
    ```bash
    # Using default sqsh filename (enroot/tensorrt_llm.devel.sqsh)
    make -C enroot build_sqsh

    # Or specify a custom path (optional)
    make -C enroot build_sqsh SQSH_PATH=/path/to/dev_trtllm_image.sqsh
    ```

3. Once this squash file is ready, you can follow the steps under [Build TensorRT LLM](#build-tensorrt-llm)by launching an enroot sandbox from `dev_trtllm_image.sqsh`. To do this, proceed as follows:
    ```bash
    export SQSH_PATH=/path/to/dev_trtllm_image.sqsh

    # Start a pseudo terminal for interactive session
    make -C enroot run_sqsh

    # Or, you could run commands directly
    make -C enroot run_sqsh RUN_CMD="python3 scripts/build_wheel.py"
    ```

**On systems without GNU `make`**

1. Create a Docker image for development.

    ```bash
    docker build --pull  \
                --target devel \
                --file docker/Dockerfile.multi \
                --tag tensorrt_llm/devel:latest \
                .
    ```

2. Run the container.

    ```bash
    docker run --rm -it \
            --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
            --volume ${PWD}:/code/tensorrt_llm \
            --workdir /code/tensorrt_llm \
            tensorrt_llm/devel:latest
    ```
    Note: please make sure to set `--ipc=host` as a docker run argument to avoid `Bus error (core dumped)`.

Once inside the container, follow the next steps to build TensorRT LLM from source.

### Advanced topics

For more information on building and running various TensorRT LLM container images,
check <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>.

## Build TensorRT LLM

### Option 1: Full Build with C++ Compilation

The following command compiles the C++ code and packages the compiled libraries along with the Python files into a wheel. When developing C++ code, you need this full build command to apply your code changes.

```bash
# To build the TensorRT LLM code.
python3 ./scripts/build_wheel.py
```

Once the wheel is built, install it by:

```bash
pip install ./build/tensorrt_llm*.whl
```

Alternatively, you can use editable installation, which is convenient if you also develop Python code.

```bash
pip install -e .
```

By default, `build_wheel.py` enables incremental builds. To clean the build
directory, add the `--clean` option:

```bash
python3 ./scripts/build_wheel.py --clean
```

It is possible to restrict the compilation of TensorRT LLM to specific CUDA
architectures. For that purpose, the `build_wheel.py` script accepts a
semicolon separated list of CUDA architecture as shown in the following
example:

```bash
# Build TensorRT LLM for Ampere.
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real"
```

To use the C++ benchmark scripts under [benchmark/cpp](source:benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:

```bash
python3 ./scripts/build_wheel.py --benchmarks
```

Refer to the {ref}`support-matrix-hardware` section for a list of architectures.

#### Building the Python Bindings for the C++ Runtime

The C++ Runtime can be exposed to Python via bindings. This feature can be turned on through the default build options.

```bash
python3 ./scripts/build_wheel.py
```

After installing, the resulting wheel as described above, the C++ Runtime bindings will be available in
the `tensorrt_llm.bindings` package. Running `help` on this package in a Python interpreter will provide on overview of the
relevant classes. The associated unit tests should also be consulted for understanding the API.

This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime).

(link-with-the-tensorrt-llm-c++-runtime)=
#### Linking with the TensorRT LLM C++ Runtime

The `build_wheel.py` script will also compile the library containing the C++ runtime of TensorRT LLM. If Python support and `torch` modules are not required, the script provides the option `--cpp_only` which restricts the build to the C++ runtime only.

```bash
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --cpp_only --clean
```

This is particularly useful for avoiding linking issues that may arise with older versions of `torch` (prior to 2.7.0) due to the [Dual ABI support in GCC](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). The `--clean` option removes the build directory before starting a new build. By default, TensorRT LLM uses `cpp/build` as the build directory, but you can specify a different location with the `--build_dir` option. For a complete list of available build options, run `python3 ./scripts/build_wheel.py --help`.

The shared library can be found in the following location:

```bash
cpp/build/tensorrt_llm/libtensorrt_llm.so
```

In addition, link against the library containing the LLM plugins for TensorRT.

```bash
cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
```

#### Supported C++ Header Files

When using TensorRT LLM, you need to add the `cpp` and `cpp/include` directories to the project's include paths.  Only header files contained in `cpp/include` are part of the supported API and may be directly included. Other headers contained under `cpp` should not be included directly since they might change in future versions.


### Option 2: Python-Only Build without C++ Compilation

If you only need to modify Python code, it is possible to package and install TensorRT LLM without compilation.

```bash
# Package TensorRT LLM wheel.
TRTLLM_USE_PRECOMPILED=1 pip wheel . --no-deps --wheel-dir ./build

# Install TensorRT LLM wheel.
pip install ./build/tensorrt_llm*.whl
```

Alternatively, you can use editable installation for convenience during Python development.

```bash
TRTLLM_USE_PRECOMPILED=1 pip install -e .
```

Setting `TRTLLM_USE_PRECOMPILED=1` enables downloading a prebuilt wheel of the version specified in `tensorrt_llm/version.py`, extracting compiled libraries into your current directory, thus skipping C++ compilation. This version can be overridden by specifying `TRTLLM_USE_PRECOMPILED=x.y.z`.

You can specify a custom URL or local path for downloading using `TRTLLM_PRECOMPILED_LOCATION`. For example, to use version 0.16.0 from PyPI:

```bash
TRTLLM_PRECOMPILED_LOCATION=https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.16.0-cp312-cp312-linux_x86_64.whl pip install -e .
```

#### Known Limitations

When using `TRTLLM_PRECOMPILED_LOCATION`, ensure that your wheel is compiled based on the same version of C++ code as your current directory; any discrepancies may lead to compatibility issues.

---

(containers)=

# Pre-built release container images on NGC

Pre-built TensorRT LLM releases are made available as container images
on NGC. This is likely the simplest way to obtain TensorRT LLM. Please refer to the [documentation in NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) for usage instructions.

{{container_tag_admonition}}

Containers can also be built locally, see
<https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>
for all related options.

---

(linux)=

# Installing on Linux via `pip`

1. Install TensorRT LLM (tested on Ubuntu 24.04).

   ### Install prerequisites

   Before the pre-built Python wheel can be installed via `pip`, a few
   prerequisites must be put into place:

   Install CUDA Toolkit 13.0 following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
   and make sure `CUDA_HOME` environment variable is properly set.

   The `cuda-compat-13-0` package may be required depending on your system's NVIDIA GPU
   driver version. For additional information, refer to the [CUDA Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html).

   ```bash
   # By default, PyTorch CUDA 12.8 package is installed. Install PyTorch CUDA 13.0 package to align with the CUDA version used for building TensorRT LLM wheels.
   pip3 install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130

   sudo apt-get -y install libopenmpi-dev
   
   # Optional step: Only required for disagg-serving
   sudo apt-get -y install libzmq3-dev
   ```

   ```{tip}
   Instead of manually installing the preqrequisites as described
   above, it is also possible to use the pre-built [TensorRT LLM Develop container
   image hosted on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel)
   (see [here](containers) for information on container tags).
   ```

   ### Install pre-built TensorRT LLM wheel

   Once all prerequisites are in place, TensorRT LLM can be installed as follows:

   ```bash
   pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
   ```
   **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**

2. Sanity check the installation by running the following in Python (tested on Python 3.12):

    ```{literalinclude} ../../../examples/llm-api/quickstart_example.py
        :language: python
        :linenos:
    ```

**Known limitations**

There are some known limitations when you pip install pre-built TensorRT LLM wheel package.

1. MPI in the Slurm environment

    If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you need to reconfigure the MPI installation to work with Slurm.
    The setup methods depends on your slurm configuration, pls check with your admin. This is not a TensorRT LLM specific, rather a general mpi+slurm issue.
    ```
    The application appears to have been direct launched using "srun",
    but OMPI was not built with SLURM support. This usually happens
    when OMPI was not configured --with-slurm and we weren't able
    to discover a SLURM installation in the usual places.
    ```

2. Prevent `pip` from replacing existing PyTorch installation

   On certain systems, particularly Ubuntu 22.04, users installing TensorRT LLM would find that their existing, CUDA 13.0 compatible PyTorch installation (e.g., `torch==2.9.0+cu130`) was being uninstalled by `pip`. It was then replaced by a CUDA 12.8 version (`torch==2.9.0`), causing the TensorRT LLM installation to be unusable and leading to runtime errors.

   The solution is to create a `pip` constraints file, locking `torch` to the currently installed version. Here is an example of how this can be done manually:

   ```bash
   CURRENT_TORCH_VERSION=$(python3 -c "import torch; print(torch.__version__)")
   echo "torch==$CURRENT_TORCH_VERSION" > /tmp/torch-constraint.txt
   pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm -c /tmp/torch-constraint.txt
   ```

---

(disaggregated-service)=

# Disaggregated-Service (Prototype)

```{note}
Note:
This feature is currently in prototype, and the related API is subjected to change in future versions.
```
Currently TRT-LLM supports `disaggregated-service`, where the context and generation phases of a request can run on different executors. TRT-LLM's disaggregated service relies on the executor API, please make sure to read the [executor page](executor.md) before reading the document.

For more information on disaggregated service in LLM inference, one can refer to papers such as [DistServe](https://arxiv.org/abs/2401.09670), [SplitWise](https://arxiv.org/abs/2311.18677).

An [architectural and performance overview](../../../docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md), as well as [usage examples](../../../examples/disaggregated/README.md), are provided.

## Environment Variables

TRT-LLM uses some environment variables to control the behavior of disaggregated service.


* `TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP`: If set to `1`, generationExecutor will not overlap KV cache transfer with model inference. The default value is `0`.

* `TRTLLM_ENABLE_KVCACHE_RECEIVE_PARALLEL`:  When the generation rank receives KV cache from multiple context ranks within a single context instance, it will receive KV cache from each rank sequentially. If set to `1`, the generation rank will receive KV cache from each rank within one context instance in parallel. The default value is `0`.

* `TRTLLM_REQUEST_KV_CACHE_CONCURRENT`: If set to `1`, generationExecutor prepares independent resources for each context executor to receive KV cache, requests whose KV cache are received from different context executors will be processed concurrently. If set to `0`, the generation executor will reuse the same resource to process KV cache transfer for each request sequentially, reducing the resources used by KV cache transmission and thereby lowering the risk of running out of memory. The default value is `0`.

* `TRTLLM_TRY_ZCOPY_FOR_KVCACHE_TRANSFER`: TRT-LLM typically copies non-contiguous data into a temporary buffer before sending KV cache. If set to `1`, TRT-LLM will attempt to directly transmit each KV cache block, eliminating extra copies. The default value is `0`.

* `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE`: By default, TRT-LLM uses a `stream-ordered memory allocator` to allocate temporary buffers. If this environment variable is set to #Size, TRT-LLM will use `cudaMalloc` to allocate buffer of size #Size for KV cache transmission. The default value is `512MB`. Users can set `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=1GB` to allocate a 1 GB buffer with `cudaMalloc` for KV cache transmission.

* `TRTLLM_KVCACHE_TRANSFER_USE_ASYNC_BUFFER`: If set to `1`, TRT-LLM will use `cudaMallocAsync` to allocate buffers for KV cache transmission. The default value is `0`. This environment variable only takes effect when `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE` is greater than 0.

* `TRTLLM_KVCACHE_SEND_MAX_CONCURRENCY_NUM`: The maximum number of concurrent KV cache sends. The default value is `1`. This environment variable only takes effect when `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE` is greater than 0.

There are some other useful environment variables that may help when encountering failures or performance issues.

* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.

* `UCX_MAX_RNDV_RAILS`: With the default value `2`, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting `UCX_MAX_RNDV_RAILS=1` can reduce contention in this case.

## Troubleshooting and FAQ

### General FAQs

*Q. What are the limitations of disaggregated-service in TRT-LLM?*

A. Currently, only `decoder-only engine` and `beamWidth=1` are supported, and the KV cache at each layer of the model is required to be homogeneous, with the same data type and the same number of attention headers.

*Q. Is the engine used by disaggregated-service different from other engines?*

A. No. There are no special requirements for the arguments to build engine.

*Q. Do the engines used by the context executor and generation executor need to be the same?*

A. No. The engines used by context executor and generation executor can be different, and their parallelism can be heterogeneous, i.e., TP,PP can be different, and TRT-LLM will handle the heterogeneity of KV cache.

*Q. Does TRT-LLM support running multiple context executor instances and generation executor instances?*

A. Yes. TRT-LLM supports running multiple context executors and generation executors at the same time, and each executor can use different engine, but it is the user's responsibility to route requests to different executors and  manage `requestId`.

*Q. Can an executor handle both context-only requests and generation-only requests?*

A. Yes, but it's not recommended, TRT-LLM does not implement proper scheduling for the case where the executor handles mixed context-only requests and generation-only requests, it's better to run context-only requests and generation-only requests on different executors.

*Q. Does disaggregated-service in TRT-LLM support multi-gpu and multi-node?*

A. Yes, it's recommended that different executor use different GPUs . We support context-only executor and genertion-only executor run on same node or different nodes. The `participantIds` and `deviceIds` used by each executor need to be explicitly set by the user, and the `participantIds` of each executor must not be intersecting.

### Debugging FAQs

*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*

A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*

A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.

*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?*

A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. A few UCX environment variables can be adjusted when your servers have different NVLink domains:

* `UCX_CUDA_IPC_ENABLE_MNNVL`: Set to `n`. This also can reduce UCX timeout error messages like `UCX  ERROR   cuMemImportFromShareableHandle failed: invalid resource handle`, although these errors don't necessarily cause your trtllm-serve to fail.

* `UCX_NET_DEVICES`: Check if this is set correctly, or unset this variable to allow UCX to use all possible devices.

* `UCX_RNDV_SCHEME`: Set to `get_zcopy` or `put_zcopy` on GB200 for better performance. The default value is `auto`.

---

(executor)=

# Executor API

TensorRT-LLM includes a high-level C++ API called the Executor API which allows you to execute requests
asynchronously, with in-flight batching, and without the need to define callbacks.

A software component (referred to as "the client" in the text that follows) can interact
with the executor using the API defined in the [`executor.h`](source:cpp/include/tensorrt_llm/executor/executor.h) file.
For details about the API, refer to the {ref}`_cpp_gen/executor.rst`.

The following sections provide an overview of the main classes defined in the Executor API.

## API

### The Executor Class

The `Executor` class is responsible for receiving requests from the client, and providing responses for those requests. The executor is constructed by providing a path to a directory containing the TensorRT-LLM engine or buffers containing the engine and the model JSON configuration. The client can create requests and enqueue those requests for execution using the `enqueueRequest` or `enqueueRequests` methods of the `Executor` class. Enqueued requests will be scheduled for execution by the executor, and multiple independent requests can be batched together at every iteration of the main execution loop (a process often referred to as continuous batching or iteration-level batching). Responses for a particular request can be awaited for by calling the `awaitResponses` method, and by providing the request id. Alternatively, responses for any requests can be awaited for by omitting to provide the request id when calling `awaitResponses`. The `Executor` class also allows to cancel requests using the `cancelRequest` method and to obtain per-iteration and per-request statistics using the `getLatestIterationStats`.

### The Request Class

The `Request` class is used to define properties of the request, such as the input token ids and the maximum number of tokens to generate. The `streaming` parameter can be used to indicate if the request should generate a response for each new generated tokens (`streaming = true`) or only after all tokens have been generated (`streaming = false`). Other mandatory parameters of the request include the sampling configuration (defined by the `SamplingConfig` class) which contains parameters controlling the decoding process and the output configuration (defined by the `OutputConfig` class) which controls what information should be included in the `Result` for a particular response.

Optional parameters can also be provided when constructing a request such as a list of bad words, a list of stop words, a client id, or configurations objects for prompt tuning, LoRA, or speculative decoding, or a number of sequences to generate for example.

### The Response Class

The `awaitResponses` method of the `Executor` class returns a vector of responses. Each response contains the request id associated with this response, and also contains either an error or a `Result`. Check if the response has an error by using the `hasError` method before trying to obtain the `Result` associated with this response using the `getResult` method.

### The Result Class

The `Result` class holds the result for a given request. It contains a Boolean parameter called `isFinal` that indicates if this is the last `Result` that will be returned for the given request id. It also contains the generated tokens. If the request is configured with `streaming = false` and `numReturnSequences = 1`, a single response will be returned, the `isFinal` Boolean will be set to `true` and all generated tokens will be included in the `outputTokenIds`. If `streaming = true` and `numReturnSequences = 1` is used, a `Result` will include one or more tokens (depending on the request `returnAllGeneratedTokens` parameter) except the last result and the `isFinal` flag will be set to `true` for the last result associated with this request.

The request `numReturnSequences` parameter controls the number of output sequences to generate for each prompt. When this option is used, the Executor will return at least `numReturnSequences` responses for each request, each containing one Result. In beam search (`beamWidth > 1`), the number of beams to be returned will be limited by `numReturnSequences` and the `sequenceIndex` attribute of the `Result` class will always be zero. Otherwise, in sampling (`beamWidth = 1`), the `sequenceIndex` attribute indicates the index of the generated sequence in the result (`0 <= sequenceIndex < numReturnSequences`). It contains a Boolean parameter called `isSequenceFinal` that indicates if this is the last result for the sequence and also contains a Boolean parameter `isFinal` that indicates when all sequences for the request have been generated. When `numReturnSequences = 1`, `isFinal` is identical to `isSequenceFinal`.

Here is an example that shows how a subset of 3 responses might look like for `numReturnSequences = 3`:

```
Response 1: requestId = 1, Result with sequenceIndex = 0, isSequenceFinal = false, isFinal = false
Response 2: requestId = 1, Result with sequenceIndex = 1, isSequenceFinal = true,  isFinal = false
Response 3: requestId = 1, Result with sequenceIndex = 2, isSequenceFinal = false, isFinal = false
```

In this example, each response contains one result for different sequences. The `isSequenceFinal` flag of the second Result is set to true, indicating that it is the last result for `sequenceIndex = 1`, however, the isFinal flag of each Response is set to false because sequences 0 and 2 are not completed.

### Sending Requests with Different Beam Widths

The executor can process requests with different beam widths if the following conditions are met:

- The model was built with a `max_beam_width > 1`.
- The executor is configured with a `maxBeamWidth > 1` (the configured `maxBeamWidth` must be less than or equal to the model's `max_beam_width`).
- The requested beam widths are less than or equal to the configured `maxBeamWidth`.

The executor may schedule successive requests with the same beam width at the same time. For successive requests with two different beam widths, `x` and `y`, requests with beam width `y` are not scheduled until all requests with beam width `x` have been processed.
This allows the runtime to reconfigure itself for a new beam width when no requests are in flight. The reconfiguration happens automatically each time requests with a different beam width than currently configured are detected. Waiting for previous requests to finish and reconfiguring the runtime may cause significant overhead and reduce overall throughput.

### Controlling output with Logits Post-Processor

Optionally, you can alter the logits produced by the network by providing an instance of `Executor::LogitsPostProcessorConfig`. For instance, this feature can be used to generate JSON formatted output. {cpp:class}`Executor::LogitsPostProcessorConfig <tensorrt_llm::executor::LogitsPostProcessorConfig>` specifies a map of named callbacks in the following form

```cpp
std::unordered_map<std::string, function<Tensor(IdType, Tensor&, BeamTokens const&, StreamPtr const&, std::optional<IdType>)>>
```

The map key is the name associated with that logits post-processing callback. Each request can then specify the name of the logits post-processor to use for that particular request, if any.

The first argument to the callback is the request id, second is the logits tensor, third are the tokens produced by the request so far, fourth is the operation stream used by the logits tensor, and last one is an optional client id. The callback returns a modified tensor of logits. Multiple requests can share same client id and callback can use different logic based on client id.

You must use the stream to access the logits tensor. For example, to perform an addition with a bias tensor, the addition operation is enqueued on that stream. Alternatively, you can call `stream->synchronize()`, however, that will slow down the entire execution pipeline.

The executor also includes a {cpp:class}`LogitsPostProcessorBatched <tensorrt_llm::executor::LogitsPostProcessorBatched>` method that enables altering logits of multiple requests in a batch. The batched method allows further optimizations and reduces callback overheads.

```cpp
std::function<void(std::vector<IdType> const&, std::vector<Tensor>&, std::vector<std::reference_wrapper<BeamTokens const>> const&, StreamPtr const&, std::vector<std::optional<IdType>> const&)>
```

A single batched callback can be specified in `LogitsPostProcessorConfig`. Each request can opt to apply this callback by specifying the name of the logits post-processor as `Request::kBatchedPostProcessorName`.

Note: Neither callback variant is supported with the `STATIC` batching type for the moment.

In a multi-GPU run, the callback is invoked on all ranks in the first tensor-parallel group, by default. To ensure correct execution, replicate the client-side state that is accessed by the callback on these ranks. If replication is expensive or infeasible, use `LogitsPostProcessorConfig::setReplicate(false)` to invoke the callback only on rank 0. The executor broadcasts the sampled tokens internally to ensure correct execution.

### Structured output with guided decoding
Guided decoding controls the generation outputs to be amenable to pre-defined structured formats, e.g., JSON or XML. Currently, guided decoding is supported with the [XGrammar](https://github.com/mlc-ai/xgrammar) backend.

To enable guided decoding, a valid instance of `GuidedDecodingConfig` must be provided when constructing `Executor`. `GuidedDecodingConfig` should be constructed with some tokenizer information, including `encodedVocab`, `tokenizerStr` (optional) and `stopTokenIds` (optional). Given a Hugging Face tokenizer, these can be extracted by:

```python
encoded_vocab = tokenizer.get_vocab()
encoded_vocab = [token for token, _ in sorted(encoded_vocab.items(), key=lambda x: x[1])]
tokenizer_str = tokenizer.backend_tokenizer.to_str()
stop_token_ids = [tokenizer.eos_token_id]
```

Refer to [`tensorrt_llm/llmapi/tokenizer.py`](source:tensorrt_llm/llmapi/tokenizer.py) for more details. You may dump these materials to disk, and reload them to C++ runtime for use.

Each request can be optionally specified with a `GuidedDecodingParams`, which defines the desired structured format. Currently, it supports four types:
* `GuidedDecodingParams::GuideType::kJSON`: The generated text is amenable to JSON format;
* `GuidedDecodingParams::GuideType::kJSON_SCHEMA`: The generated text is amenable to JSON format with additional restrictions;
* `GuidedDecodingParams::GuideType::kREGEX`: The generated text is amenable to regular expression;
* `GuidedDecodingParams::GuideType::kEBNF_GRAMMAR`: The generated text is amenable to the extended Backus-Naur form (EBNF) grammar.

The latter three types should be used with the schema/regex/grammar provided to `GuidedDecodingParams`.

### Obtaining Arbitrary Output Tensors
The executor API gives the user the possibility to read the arbitrary outputs from the model. For example, it is possible to obtain hidden states or logits.

#### Mark Tensors As Output
For a tensor to be obtainable using this feature, it needs to be marked as an output in the model definition (e.g. add `topk_logits.mark_output("TopKLogits")`) before building the TRT engine.

#### Configure The Executor
Assuming the TensorRT engine you are planning to use has a tensor named `TopKLogits` marked as output, you should then configure the `Executor` to read from this output tensor by passing its name to the `ExecutorConfig` configuration object:
```cpp
auto const executorConfig = ExecutorConfig{};

std::vector<executor::AdditionalModelOutput> additionalOutputs{
    executor::AdditionalModelOutput{"TopKLogits", /*whether or not to get the output for the context too */ true}};
executorConfig.setAdditionalModelOutputs(additionalOutputs);

// ... set more configuration options if needed
// ... create the `Executor` instance
```

### Request Additional Output
Construct a request to enqueue in the executor to query this tensor output:
```cpp
executor::Request request{requestTokens, parameters.maxOutputLength, true, executor::SamplingConfig{},
    executor::OutputConfig{false, false, false, true, false, false, additionalOutputs}};
executor.enqueueRequest(request);
```

The output can be found at the `additionalOutputs` property of each response.

#### Note on context outputs

If KV cache reuse is enabled, context outputs will not contain outputs for the part of the context that has been reused. This part of the outputs can only be obtained from the prior request with the same prefix that generated this part of the KV cache.

## C++ Executor API Example

Two C++ examples are provided that shows how to use the Executor API and can be found in the [`examples/cpp/executor`](source:examples/cpp/executor/) folder.

## Python Bindings for the Executor API

Python bindings for the Executor API are also available to use the Executor API from Python. The Python bindings are defined in [bindings.cpp](source:cpp/tensorrt_llm/pybind/executor/bindings.cpp) and once built, are available in package `tensorrt_llm.bindings.executor`. Running `'help('tensorrt_llm.bindings.executor')` in a Python interpreter will provide an overview of the classes available.

In addition, three Python examples are provided to demonstrate how to use the Python bindings to the Executor API for single and multi-GPU models. They can be found in [`examples/bindings`](source:examples/bindings).

## In-flight Batching with the Triton Inference Server

A Triton Inference Server C++ [backend](https://github.com/triton-inference-server/tensorrtllm_backend) is provided with TensorRT-LLM that
includes the mechanisms needed to serve models using in-flight batching. That
backend is also a good starting example of how to implement in-flight batching using
the TensorRT-LLM C++ Executor API.

---

(expert-parallelism)=

# Expert Parallelism in TensorRT-LLM

## Mixture of Experts (MoE)

Mixture of Experts (MoE) architectures have become widespread, with models such as [Mistral Mixtral 8×7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). Specifically, MoE’s structure supports multiple parallel feed-forward neural-network (FFN) layers (called experts) in place of the single FFN layer in a dense model. When tokens arrive, the router layer selects the top-k experts for each token, and the corresponding hidden state of each token is dispatched to those experts. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/moe_structure.png?raw=true" alt="moe_structure" width="500" height="auto">

<sub>the MOE structure in Switch Transformer: [https://arxiv.org/pdf/2101.03961.pdf](https://arxiv.org/pdf/2101.03961.pdf) </sub>

## Tensor Parallel vs Expert Parallel

Parallelism on multi-GPUs is necessary if the MoE model can not be accommodated by a single GPU’s memory.  We have supported two kinds of parallel patterns for MoE structure, Tensor Parallel (default pattern), Expert Parallel, and a hybrid of the two.

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/tp_ep.png?raw=true" alt="tensor parallel vs expert parallel" width="500" height="auto">

Tensor Parallel evenly splits each expert’s weight and distributes them to different GPUs, which means each GPU holds partial weight of all experts, While Expert Parallel evenly distributes some of the experts’ full weight to different GPUs, which means each GPU holds part of the experts’ full weight. As a result, each GPU rank in the Tensor Parallel group receives all tokens’ hidden states for all experts, then computes using the partial weights, while for Expert Parallel, each GPU rank only receives part of tokens’ hidden states for experts on this rank, then computes using the full weights.

When both Tensor Parallel and Expert Parallel are enabled, each GPU handles a portion of the expert weights matrices (as in EP mode) and these weights are further sliced across multiple GPUs (as in TP mode). This hybrid approach aims to balance the workload more evenly across GPUs, enhancing efficiency and reducing the likelihood of bottlenecks associated with EP mode alone.


## How to Enable

The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting `--moe_tp_size` and `--moe_ep_size` when calling `convert_checkpoint.py`. If only `--moe_tp_size` is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only `--moe_ep_size` is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.

Ensure the product of `moe_tp_size` and `moe_ep_size` is equal to `tp_size`, since the total number of MoE parallelism across all GPUs must match the total number of parallelism in other parts of the model.

The other parameters related to the MoE structure, such as `num_experts_per_tok` (TopK in previous context) and `num_local_experts,` can be found in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).

---

(gpt-attention)=

# Multi-Head, Multi-Query, and Group-Query Attention

This document details the implementation of multi-head attention (MHA),
multi-query attention (MQA) and group-query attention (GQA) for auto-regressive
GPT-like models in TensorRT-LLM.  As a quick reminder, the multi-head attention
is the sequence of a batched matmul, a softmax and another batched matmul
described in the
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article. [Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are variants of MHA that use fewer, so-called, K/V head than the number of query heads. TensorRT-LLM, MHA, MQA and GQA are implemented by the operator [`tensorrt_llm.functional.gpt_attention`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py).

## Important Note

As discussed below, the current implementation supports two input modes: Padded
and packed (non-padded). As the packed mode is always more memory-efficient and
faster than the padded mode, ***support for padded mode may be removed in the
future***.

## Padded and Packed Tensors

In TensorRT-LLM, the GPT attention operator supports two different types
of QKV inputs: Padded and packed (i.e. non padded) inputs. The mode is
determined by the global configuration parameter `remove_input_padding` defined
in [`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py).

When padding is enabled (that is, `remove_input_padding` is `False`), the sequences
that are shorter than the `max_sequence_length` are padded to that maximum
length. It may result in excessive memory consumption as well as unneeded
computations on padding tokens (in the various matrix multiplications that
surround the MHA block).

To overcome that problem, TensorRT-LLM supports a mode without padding where
the different tokens are packed together and the user provides the operator
with a 1D tensor containing the lengths of the different sequences. It is
recommended that users to always use packed mode (and support for the padded
mode may be removed in the future).

## Context and Generation Phases

The GPT attention operator encapsulates different implementations for both
context and generation phases in auto-regressive models like GPT.

### Context Phase

If the `context_fmha_type` is set to `disabled` (refer to
[`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py)),
the implementation maps to a sequence of GPU kernels that will store the
intermediate `Q*K^T` tensor in memory before calling the softmax operator. It
is the slowest method and the memory footprint is significant (quadratically
depends on the sequence length).

Otherwise, if `context_fmha_type` is set to a `enabled` or
`enabled_with_fp32_acc` (accumulation in the first batched matmul is forced to
FP32), that function will trigger a kernel that performs the MHA/MQA block
using a single kernel. For short sequences, that kernel uses a vanilla
implementation of MHA/MQA. For larger sequences, this kernel uses the Flash
Attention algorithm as described in
[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
and
[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691).

Currently, the implementation triggers extra kernels that apply pre-processing
to the elements (like RoPE) and populate the KV cache (see below). In a future
release, the number of such kernels is planned on being reduced in order to
improve the overall performance.

#### FP8 Context FMHA

When FP8 quantization is activated, the attention can be further accelerated by
enabling FP8 Context FMHA (`use_fp8_context_fmha = enable`).

FP8 Paged Context FMHA is also supported with the fp8 quantization workflow.
You need to specify `use_fp8_context_fmha = enable` and
`use_paged_context_fmha = enable` at the same time.

Please be aware that this feature is only supported on Ada and Hopper.

### Generation Phase

The generation phase is implemented using a single kernel called the masked
multi-head attention in TensorRT-LLM. That kernel is able to apply
pre-processing on the Q, K, and V elements on-the-fly: adds the QKV bias, applies
RoPE, and performs dequantization and quantization. TensorRT-LLM will continue to add (or
enable) additional features in future releases. For example, enable the support
for IA3.

_The masked MHA kernel has a special version that distributes the work across
multiple CUDA thread-blocks on the GPU for cases where the GPU occupancy is
low. That mode called multi-block is turned on by default starting from TRT-LLM 0.13,
and can be disabled using `--multi_block_mode=False` during runtime.
Users are recommended to test that mode in scenarios where both the batch
size and the number of heads in the model are relatively small. The exact
definition of small in that context will depend on the model of the GPU and is
hard to predict but to provide with a rule of thumb, it is worth testing that
mode when `batch_size * num_heads` is less than the number of multi-processors
on the GPU (that suggestion may evolve in the future as more research is
conducted and the software improves)_.

_Note that even if the multi-block mode is enabled, the attention operator will
not immediately trigger the multi-block version of the GPU kernel. There is a
minimum number of tokens (input + generated) that are required for the
multi-block version to become more efficient than the "vanilla" implementation
that uses a single CUDA thread-block per head. It is controlled by an internal
heuristic._

Another note is that as the masked MHA kernels use shared memory size
proportional to sequence length, so there can be some cases that GPU's shared
memory is not enough when multi-block mode is not enabled. To get masked MHA
kernel work in these cases, multi-block mode is forced on and a warning log is
printed.

#### XQA Optimization

Another optimization for MQA/GQA in generation phase called XQA optimization.

Support matrix of the XQA optimization:
 - FP16 / BF16 compute data type.
 - FP16 / BF16 / FP8 / INT8 KV cache data type.
 - Paged KV cache (8 / 16 / 32 / 64 / 128 tokens per block).

This is default enabled. To disable this, you need to use the
flag `--disable_xqa` when building the engines. Note that a heuristic algorithm
is also used to decide whether to use XQA kernel or masked MHA kernel to get
better performance. That means even `--disable_xqa` is not set, XQA kernels
may not also be used. If you want to always use that kernel when possible,
`TRTLLM_FORCE_XQA=1` can be set to force use XQA kernels when the model config
is supported. Detailed supported configuration can be found function `shouldUse`
of class `DecoderXQARunner` in
`cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h`.


(inflight-batching)=

## In-flight Batching

TensorRT-LLM supports in-flight batching of requests (also known as continuous
batching or iteration-level batching) for higher serving throughput. With this feature,
sequences in context phase can be processed together with sequences in
generation phase. The purpose of that technique is to better interleave
requests to reduce latency as well as make better use of the GPUs.
For efficiency reasons (1), the support for inflight batching ***requires the
input tensors to be packed (no padding)***.

***In the current implementation, the sequences that are going through the
context phase must be before the sequences in the generation phase in the input
tensor. For example, for sequences `S0`, `S1` and `S2`, if `S0` and `S2` are in
context phase (and `S1` in generation), tokens from `S0` and `S2` must appear
before the tokens of `S1` in the input tensor***. The constraint may or may not
be relaxed in a future version.

_(1) Padding sequences in the generation phase, that contain a single token, to
the length of the maximum input sequence is inefficient use of resources_.


## Chunked Context

In the original state, the common behavior was to process all context tokens at
once. This feature splits the context into several chunks. In this way, the
context chunks can be batched with more tokens during the generation phase,
which is expected to increase the total throughput. Chunking contexts also removes
constraints on input length. To enable this feature, the FMHA paged kv-cache also
needs to be enabled. Except for the last one, the size of the context chunk needs
to be an integer multiple of the kv-cache block size. Refer to
[the performance best practices](../performance/perf-best-practices.md#chunked-context) for usage.

## KV Cache

In the generation phase, a common optimization is to provide the MHA kernel
with a cache containing the values of the past K and V elements that have
already been computed.  That cache is known as the KV cache. TensorRT-LLM uses
that technique to accelerate its generation phase. In TensorRT-LLM, there is
one KV cache per Transformer layer, which means that there are as many KV
caches as layers in a model. The current version of TensorRT-LLM supports two
different types of KV caches: **contiguous** and **paged** KV caches.

### Contiguous KV Cache

The contiguous KV cache is a monolithic tensor. Its shape is:
```
[max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].
```

That implementation uses a lot more memory than needed when the sequences are
shorter than the maximum sequence length (even if they end up close to the
limit after the generation of many output tokens, it may take a lot of steps to
reach that point).

### Paged KV Cache

The paged KV cache decomposes the KV cache into blocks that are distributed to
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocate new blocks from a pool and recycle those
blocks when required. See the simplified implementation of
[`tensorrt_llm.runtime.KVCacheManager`](source:tensorrt_llm/runtime/kv_cache_manager.py).
A more efficient C++ implementation is included in the
[Batch Manager](source:cpp/include/tensorrt_llm/batch_manager).

## INT8/FP8 KV Caches

In its current implementation, even if the rest of the network runs in INT8 or
FP8, the GPT attention operator works with FP32, FP16, and BFloat16 inputs and
outputs. However, TensorRT-LLM supports INT8 and FP8
(`kv_cache_quant_mode=QuantMode.INT8_KV_CACHE` and
`kv_cache_quant_mode=QuantMode.FP8_KV_CACHE`) KV caches.

The GPT attention operator populates the KV cache. When INT8 or FP8 KV caches
are enabled, the input values have to be quantized to 8 bits using a scaling
factor. For quantization, the scaling factor is stored in the
`kv_cache_scaling_factor` tensor. Its shape is `[1]` and only per-tensor
quantization is supported in the current version. Quantization uses inversed scale
since it does multiply as `fp_value * (1.0 / kv_cache_scaling_factor)` in plugin.

During generation, the values read from the cache are dequantized on-the-fly in
the MHA/MQA kernel, dequantization can be described as
`quantized_value * kv_cache_scaling_factor`.


## Sliding Window Attention, Cyclic (Rolling Buffer) KV Cache

TensorRT-LLM has a feature called `Cyclic KV Cache`, which treats the kv cache
as a circular buffer. This means that it only stores the kv cache for the last N
tokens, where N is determined by the `max_attention_window_size` parameter in
`GenerationSession.setup`. You can see examples of this in the `run.py` or
`summarize.py` files. When the cache is full, new tokens’ kv cache will
overwrite the "least recently used" caches.

In the context phase, if the input length surpasses the `max_attention_window_size`,
`Sliding Window Attention` will be activated. This serves the same function as
the `sliding window_size`.

This feature helps to reduce the memory footprint of the kv cache when
dealing with very long sequences.

The feature, which allows different `max_attention_window_size` values
for each layer, is also supported. To utilize this feature, simply provide an
`int32 torch.Tensor` or `list` to the `GenerationSession.setup` when using python
runtime session, or provide a vector to the `KvCacheConfig` when using cpp runtime.
If the number of the provided elements is less than the number of layers, the provided
tensor/list/vector will be repeated multiple times to the number of layers and then be
saved as a new tensor. This tensor will serve as the buffer for `max_attention_window_size`,
setting unique values for each layer. However, it’s important to note that the
memory allocation for the kv cache still relies on the buffer’s maximum value.

_Note that the cyclic kv cache feature doesn't work with beam searching currently as
the context kv cache are shared across beams.

## StreamingLLM

The StreamingLLM feature uses a window attention to perform efficient and stable LLM
on long texts, which means that only `N` tokens need to be stored in the KV cache.
Similar to the cyclic KV cache feature in TensorRT-LLM, `max_attention_window_size`
parameter is used to determine `N`. Different from the cyclic KV cache feature,
the first `S` tokens, called sink tokens, are always kept in the attention window,
where `S` is determined by `sink_token_length` parameter in `GenerationSession.setup`.
But in context phase, the self-attentions is dense in the official implementation of
StreamingLLM, and it uses all of the tokens for computation and only saves `N` tokens
to the KV cache.

In addition, the relative position embedding is also changed in StreamingLLM.
When determining the relative distance and adding positional information to tokens,
StreamingLLM use the positions within the cache rather than those in the original text.

`streamingllm` flag is used to enable this feature.

## Beam-Search

The GPT attention operator supports beam-search. In the context phase, a single
beam is computed per input sequence. In the generation phase, the MHA/MQA/GQA
kernel uses an additional tensor to reconstruct the correct path for each beam.
That tensor is called the `cache_indirection`. Its shape is `[batch_size,
beam_width, max_seqlen]`.

For a sequence `si`, a beam `bi` and a token `ti`, the element
`cache_indirection[si][bi][ti]` is an integer between `0` and `beam_width-1`
that indicates which path in the beam to read the K and V elements from in the
KV cache. This tensor is populated in the sampling stage.

## Input QKV tensor

The input QKV tensor packs the Q, K and V tensors (concatenated along the last
dimension) after the projection of the hidden states. It is a 3D tensor. RoPE
and quantization to INT8 or FP8 (when needed) are performed by the GPT
attention operator.

In padded mode, its shape is `[batch_beam_size, max_seqlen, 3 * hidden_dim]`
where `batch_beam_size` is the batch size (number of sequences) for the context
phase and the batch size multiplied by the beam width for the generation phase.
Having different beam widths per sequence in padded mode is not supported.

In packed mode, its shape is `[num_tokens, 3 * hidden_dim]` where
`num_tokens` is the total number of tokens in the batch. For the sequences in
context phase, the number of tokens of a sequence corresponds to its input
length (even if the beam width is greater than `1` for beam search).  For the
sequences in generation phase, there are `beam_width` tokens per sequence. The
beam width can be different for each sequence.

In other words, the pseudo-code to compute the number of tokens is:

```python
num_tokens = 0

# Add the length of each sequence in context phase.
for seq in context_phase:
    num_tokens += seq.length

# Add the width of the beam for each sequence in generation phase.
for seq in generation_phase:
    num_tokens += seq.beam_width
```

### Rotary Positional Embedding (RoPE)

The GPT attention operation can perform the computation of the Rotary
Positional Embedding (RoPE). When that operation is enabled,
`rotary_embedding_dim` is set to a value greater than 0, it is fused with other
operations. The GPT operator supports GPT-NeoX and GPT-J forms of RoPE by
setting `position_embedding_type` to `PositionEmbeddingType.rope_gpt_neox`
or `PositionEmbeddingType.rope_gptj`.

### ALiBi

The GPT attention operator can apply ALiBi to the result of the `Q*K^T`
product. The bias is computed on-the-fly from the ALiBi slopes in the optimized
kernel.

### Scaling factor(s)

In MHA, the output of the `Q*K^T` product is scaled by a constant value that
is computed as:

```
norm_factor = 1.f / (q_scaling * sqrt(head_size)).
```

### Cross Attention

On top of the MHA as self attention needed by GPT-style decoder-only models, `gpt_attention` also supports cross attention.

This enables using `gpt_attention` in a broader aspect as a generic decoder component. For example, the Encoder-Decoder model uses `gpt_attention` to issue both the self attention and cross attention modules in its Decoder.

### Relative Attention Bias (RAB)

Relative attention bias (RAB) is a kind of relative position modeling, adding an attention bias (`Q*K^T+bias`) according to relative positions. RAB is a lightweight method to include the information of relative positions, and is used in the popular Encoder-Decoder model [T5](https://huggingface.co/docs/transformers/model_doc/t5) and also other models in the T5 family.

RAB is supported in two modes: i) regular mode which user passes in relative attention bias computed ahead of MHA. ii) implicit mode which computes the relative attention bias on the fly in MHA. The implicit mode suits the case when the relative attention bias is too large to fit in memory and can be turned on by passing in `max_distance`.

---

(gpt-runtime)=

# C++ GPT Runtime

TensorRT-LLM includes a C++ component to execute TensorRT engines built with
the Python API as described in the {ref}`architecture-overview` section.
That component is called the C++ runtime.

The API of the C++ runtime is composed of the classes declared in
[`cpp/include/tensorrt_llm/runtime`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/include/tensorrt_llm/runtime) and
implemented in [`cpp/tensorrt_llm/runtime`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/runtime).

Even if the different components described in that document mention GPT in
their name, they are not restricted to this specific model. Those classes can
be used to implement auto-regressive models like BLOOM, GPT-J, GPT-NeoX or
LLaMA, for example.

Complete support of encoder-decoder models, like T5, will be added to
TensorRT-LLM in a future release. An experimental version, only in Python for
now, can be found in the [`examples/models/core/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) folder.

## Overview

Runtime models are described by an instance of the
[`ModelConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h)
class and a pointer to the TensorRT engine that must be
executed to perform the inference.
The environment is configured through the
[`WorldConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h)
(that name comes from
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) and its "famous"
`MPI_COMM_WORLD` default communicator).
The [`SamplingConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h)
class encapsulates parameters that control the
[generation](https://huggingface.co/blog/how-to-generate) of new tokens.

### Model Configuration

The model configuration is an instance of the
[`ModelConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h) class.
That class encapsulates the following parameters (they are declared as private
member variables and exposed through getters and setters):

 * `vocabSize`, the size of the vocabulary,
 * `numLayers`, the number of layers in the model,
 * `numHeads`, the number of heads in the attention block,
 * `numKvHeads`, the number of heads for K and V in the attention component.
   When the number of K/V heads is the same as the number of (Q) heads, the
   model uses multi-head attention. When the number of K/V heads is 1, it uses
   multi-query attention. Otherwise, it uses group-query attention. Refer to {ref}`gpt-attention` for more information,
 * `hiddenSize`, the size of the hidden dimension,
 * `dataType`, the datatype that was used to build the TensorRT engine and that
   must be used to run the model during inference,
 * `useGptAttentionPlugin`, indicates if the {ref}`gpt-attention` operator was compiled using the
   [GPT Attention plugin](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/plugins/gptAttentionPlugin),
 * `inputPacked`, indicates that the input must be packed (or padded when set
   to `false`). For performance reasons, it is recommended to always use packed,
   even if its default is set to `false` (will be changed in a future release).
   Refer to {ref}`gpt-attention` for more information,
 * `pagedKvCache`, indicates if the K/V cache uses paging.
   Refer to {ref}`gpt-attention` for more information,
 * `tokensPerBlock`, is the number of tokens in each block of the K/V cache.
   It's relevant when the paged K/V cache is enabled. By default, the value is
   64. Refer to {ref}`gpt-attention` for more information,
 * `quantMode`, controls the quantization method. Refer to {ref}`precision` for more information.
 * `maxBatchSize`, indicates the maximum batch size that the TensorRT engine
   was built for,
 * `maxInputLen`, the maximum size of the input sequences,
 * `maxSequenceLen`, the maximum total size (input+output) of the sequences.

### World Configuration

Familiarity with
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface), is not required
to utilize the TensorRT-LMM C++ runtime. There are two main things
you need to know:
* The C++ Runtime in TensorRT-LLM uses
[processes](https://en.wikipedia.org/wiki/Process_(computing)) to execute
TensorRT engines on the different GPUs. Those GPUs can be located on a single
node as well as on different nodes in a cluster. Each process is called a
*rank* in MPI.
* The ranks are grouped in communication groups. The
TensorRT-LLM C++ Runtime calls that group the *world*.

The world configuration is an instance of the
[`WorldConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h)
class, which encapsulates the following parameters:

* `tensorParallelism`, the number of ranks that collaborate together to
  implement Tensor Parallelism (TP). With TP, each GPU performs computations for
  all the layers of the model. Some of those computations are distributed
  across the GPU. TP is more balanced than Pipeline Parallelism (PP), in most cases, but
  requires higher bandwidth between the GPUs. It is the recommended setting in
  the presence of NVLINK between GPUs,
* `pipelineParallelism`, the number of ranks that collaborate together to
  implement Pipeline Parallelism (PP). With PP, each GPU works on a subset of
  consecutive layers. Communications between the GPUs happen only at the
  boundaries of the subsets of layers. It is harder to guarantee the full
  utilization of the GPUs with PP but it requires less memory bandwidth. It
  is the recommended setting in the absence of NVLINK between GPUs,
* `rank`, the unique identifier of the rank,
* `gpusPerNode`, indicates the number of GPUs on each node. Having that
  information allows the C++ runtime to optimize communications between GPUs in
  a node (like taking advantage of the
  [NVLINK](https://www.nvidia.com/en-us/data-center/nvlink/)
  interconnect between GPUs of an A100
  [DGX](https://www.nvidia.com/en-us/data-center/dgx-platform/)
  node).

### Sampling Parameters

The [`SamplingConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h)
class encapsulates parameters that control the
[generation](https://huggingface.co/blog/how-to-generate) of new tokens.
A comparison of selecting decoding method is listed as the table below (`X` means it is not supported yet).
Except for the `beamWidth` parameter, all the fields are optional and the
runtime will use a default value if no values are provided by the user. For
vector fields, the TensorRT-LLM runtime supports one value per sequence (that is,
the vector contains `batchSize` values). If all the sequences use the same
value for a given parameter, the vector can be limited to a single element
(that is, `size() == 1`).

|        Method name in HF         |                    Condition in HF                    | Method name in TRT-LLM |              Condition in TRT-LLM              |
| :------------------------------: | :---------------------------------------------------: | :--------------------: | :--------------------------------------------: |
|        assisted decoding         | `assistant_model` or `prompt_lookup_num_tokens!=None` |           X            |                                                |
|       beam-search decoding       |          `num_beams>1` and `do_sample=False`          |      beam search       |                `beamWidth > 1`                 |
| beam-search multinomial sampling |          `num_beams>1` and `do_sample=True`           |           X            |                                                |
| constrained beam-search decoding |    `constraints!=None` or `force_words_ids!=None`     |           X            |                                                |
|        contrastive search        |            `penalty_alpha>0` and `top_k>1`            |           X            |                                                |
|   diverse beam-search decoding   |         `num_beams>1` and `num_beam_groups>1`         |           X            |                                                |
|         greedy decoding          |          `num_beams=1` and `do_sample=False`          |        sampling        | `beamWidth == 1` and `topK=0` and `topP=0.0f`  |
|       multinomial sampling       |          `num_beams=1` and `do_sample=True`           |        sampling        | `beamWidth == 1` and (`topK>0` or `topP>0.0f`) |

***General***

|   Name in TRT-LLM   |                                    Description                                    |   Data type   |                                      Range of value                                       |                     Default value                     |       Name in HF       |
| :-----------------: | :-------------------------------------------------------------------------------: | :-----------: | :---------------------------------------------------------------------------------------: | :---------------------------------------------------: | :--------------------: |
|    `temperature`    |                     modulation of logits in sampling workflow                     | List\[Float\] |                                    \[0.0f, $+\infty$\)                                    |                `1.0f` (no modulation)                 |     `temperature`      |
|     `minLength`     |                   lower-bound on the number of tokens generated                   |  List\[Int\]  |                                     \[0, $+\infty$\)                                      | `0` (no effect (the first generated token can be EOS) |      `min_length`      |
| `repetitionPenalty` | penalize repetitive tokens <br> multiplicative, irrespective of appearances count | List\[Float\] |   \[0.0f, $+\infty$\) <br> `< 1.0f` encourages repetition <br> `> 1.0f` discourages it    |                  `1.0f` (no effect)                   |  `repetition_penalty`  |
|  `presencePenalty`  |     penalize existed tokens <br> additive, irrespective of appearances count      | List\[Float\] | \($-\infty$, $+\infty$\) <br> `< 0.0f` encourages repetition <br> `> 0.0f` discourages it |                  `0.0f` (no effect)                   |           no           |
| `frequencyPenalty`  |       penalize existed tokens <br> additive, dependent on appearances count       | List\[Float\] | \($-\infty$, $+\infty$\) <br> `< 0.0f` encourages repetition <br> `> 0.0f` discourages it |                  `0.0f` (no effect)                   |           no           |
| `noRepeatNgramSize` |                                                                                   |  List\[Int\]  |          \[0, $+\infty$\) <br> `> 0` all ngrams of that size can only occur once          |                    `0` (no effect)                    | `no_repeat_ngram_size` |

* The tokens of input prompt are included during adopting `repetitionPenalty`, `presencePenalty`, and `frequencyPenalty` onto logits.

* The parameters `repetitionPenalty`, `presencePenalty`, and `frequencyPenalty` are not mutually exclusive.

***Sampling***

| Name in TRT-LLM |               Description                                               |   Data type   |  Range of value   |  Default value   | Name in HF |
| :-------------: | :---------------------------------------------------------------------: | :-----------: | :---------------: | :--------------: | :--------: |
|  `randomSeed`   | random seed for random number generator                                 |     Int64     |   \[0, 2^64-1\]   |       `0`        |     no     |
|     `topK`      |   the number of logits to sample from                                   |  List\[Int\]  |    \[0, 1024\]    |       `0`        |  `top_k`   |
|     `topP`      |  the top-P probability to sample from                                   | List\[Float\] |  \[0.0f, 1.0f\]   |      `0.0f`      |  `top_p`   |
|   `topPDecay`   |    the decay in the `topP` algorithm                                    | List\[Float\] |  \(0.0f, 1.0f\]   |      `1.0f`      |     no     |
|    `topPMin`    |    the decay in the `topP` algorithm                                    | List\[Float\] |  \(0.0f, 1.0f\]   |    `1.0e-6,f`    |     no     |
| `topPResetIds`  |    the decay in the `topP` algorithm                                    |  List\[Int\]  | \[-1, $+\infty$\) | `-1` (no effect) |     no     |
|     `minP`      | scale the most likely token to determine the minimum token probability. |  List\[Float\]  | \[0.0f, 1.0f\] | `0.0` (no effect) |     `min_p`     |

 * If setting `topK = 0` and `topP = 0.0f`, greedy search is performed.
 * If setting `topK > 0` and `topP = 0.0f`, `topK` tokens of highest probabilities will become the candidates of sampling (named `TopK sampling` in TRT-LLM).
 * If setting `topK = 0` and `topP > 0.0f`, tokens will be sorted with probability descendly, then the tokens with highest probabilities which the accumulated probability larger than `topP` will become the candidates of sampling (named `TopP sampling` in TRT-LLM).
 * If setting `topK > 0` and `topP > 0.0f`, `topK` tokens of highest probabilities will be selected, then those selected tokens will be sorted with probability descendly and their probability will be normalized, then the tokens with highest normalized probabilities which the accumulated probability larger than `topP` will become the candidates of sampling (named `TopKTopP sampling` in TRT-LLM)

 * If different `topK` values are provided for the different sequences in the batch, the performance of the implementation will depend on the largest value. For efficiency reasons, we recommend to batch requests with similar `topK` values together.

 * `topPDecay`, `topPMin` and `topPResetIds` are explained in
   [_Factuality Enhanced Language Models for Open-Ended Text Generation_](https://arxiv.org/abs/2206.04624).
   `topPDecay` is the decay, `topPMin` is the lower-bound and `topPResetIds` indicates where to reset the decay.

 * `minP` is explained in [_Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs_](https://arxiv.org/abs/2407.01082).

 * TensorRT-LLM does not generate all possible tokenizations of a word. Therefore, stop words may appear in the output if there are multiple ways to tokenize a stop word and the token sequence in the output differs from the one in `stopWords`.

***Beam-search***

|      Name in TRT-LLM      |           Description           |      Data type      |      Range of value      |       Default value       |     Name in HF      |
| :-----------------------: | :-----------------------------: | :-----------------: | :----------------------: | :-----------------------: | :-----------------: |
|        `beamWidth`        | width for beam-search algorithm |         Int         |       \[0, 1024\]        | `0` (disable beam search) |    `beam_width`     |
| `beamSearchDiversityRate` |  diversity of generated tokens  |    List\[Float\]    |     \[0, $+\infty$\)     |          `0.0f`           | `diversity_penalty` |
|      `lengthPenalty`      |    penalize longer sequences    |    List\[Float\]    |     \[0, $+\infty$\)     |          `0.0f`           |  `length_penalty`   |
|      `earlyStopping`      |      see description below      |     List\[Int\]     | \($-\infty$, $+\infty$\) |            `0`            |  `early_stopping`   |
|     `beamWidthArray`      |      see description below      | List\[List\[Int\]\] |       \[0, 1024\]        |            ``             |         no          |

 * Beam-search algorithm: [beam search](https://en.wikipedia.org/wiki/Beam_search).
 * Parameter `diversity_penalty` in HF is only used for `diverse beam-search decoding` (or named `Group-Beam-Search`), which is not supported by TRT-LLM yet.
 * If setting `earlyStopping = 1`, decoding will stop once `beamWidth` finished sentences are generated.
 * If setting `earlyStopping = 0`, decoding will keep going until no better sentences (with better score) can be generated.
 * If setting `earlyStopping` to other values, decoding will stop only depending on `lengthlengthPenalty`.
 * `beamWidthArray` is a list of beam width for each step. Using `beamWidthArray = [20,40,80]` as an example,
beam width will be 20 for the first step, 40 for second step, 80 for the later all steps.
 * The `beamWidth` parameter is a scalar value. It means that in this release of
TensorRT-LLM, it is not possible to specify a different width for each input
sequence. This limitation is likely to be removed in a future release.

### Internal Components

The [`TllmRuntime`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/tllmRuntime.h) is in charge of the execution of the TensorRT engine.
The `TllmRuntime` class is an internal component and you are not expected to use that class directly.
The [`GptDecoder`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/gptDecoder.h) generates tokens from the logits.
The `GptDecoder` can be used directly to implement a custom generation loop and for use cases that cannot be satisfied by the TRT-LLM implementation.

---

(graph-rewriting)=


# Graph Rewriting Module

TensorRT-LLM uses a declarative approach to define neural networks and contains
techniques to optimize the underlying graph.  It provides a wrapper similar to PyTorch's Module. When a user invokes the `forward` method, the layers are lowered to TensorRT's `ILayer`s and become part of an `INetworkDefinition`. The Graph Rewriting (GW) module can be used to manipulate the network at the `ILayer`/`INetworkDefinition` level.

## When to Use Graph Rewriting?

For network manipulation, there are two options in TensorRT-LLM:

1. **Module Rewriting:** This method modifies the members of `Module` instances before triggering the `forward` method (that is, creating the TensorRT graph). It works on the highest level of the network representation and facilitates the modification of sequences of operations (like modifying the GEMM + activation for SmoothQuant),

2. **Graph Rewriting:** Graph Rewriting manipulates TensorRT's `INetworkDefinition` after the `forward` method is triggered. It operates at a finer-grained `ILayer` level and can alter the structure across multiple Module instances. It is typically used for layer fusion.

Graph Rewriting (GW) is ideally used in the following conditions:

1. When only `ILayer`/`INetworkDefinition` is available,
2. When Module Rewriting would lead to nested control flow or scattered functionality.

## Graph Rewriting APIs

Several core APIs are provided for Graph Rewriting:

### Tensor-Related Methods

- `Tensor.get_parent`: Get the `ILayer` that produces this tensor,
- `Tensor.get_users`: Get the consumer `ILayer`s of this tensor,
- `replace_all_uses_with`: Replace this tensor with another tensor in all consumer `ILayer`s.

### FLayerInfo for Retrieving High-Level Information for a Functional

For all the layers located in `functional.py`, the original input information is missing once lowered to `INetworkDefinition`, especially for TensorRT plugins, which are opaque in the Python world. `FLayerInfo` holds their original information as a high-level signature containing inputs like `Tensor`, Python attributes, and more. There is a Network-wise singleton called `FLayerInfoMemo` to map each `ILayer` to its corresponding `FLayerInfo`.

For `FLayerInfo`:

- `FLayerInfo.replace_input_with`: Replace some input tensor with another tensor,
- `FLayerInfo.replace_output_uses_with`: Redirect the usage of the original output tensors to a set of new tensors.

For `FLayerInfoMemo`:

- `FLayerInfoMemo.instance()`: Get the singleton instance,
- `FLayerInfoMemo.get`: Get the corresponding `FLayerInfo` for an `ILayer`.

`FLayerInfo` remains consistent with the actual `ILayer` during GW, making it safe to use.

### Pattern and Pattern Manager

There are two kinds of patterns:

- `PatternRewriter`: Used for defining a rewriting pattern, which actually alters the network.
  - `match`: Match the pattern; returns true if a layer is matched,
  - `rewrite`: Manipulate a layer,
  - `match_and_rewrite`: Combines both `match` and `rewrite`, used for complex states that need to pass from `match` to `rewrite`.

- `PatternAnalyzer`: Used for defining an analysis pattern, which collects information from the network.
  - `match`: Match the pattern,
  - `analyze`: Perform analysis on a list of layers.

There are two managers for managing multiple `PatternRewriter` or `PatternAnalyzer`:

- `RewritePatternManager`:
  - `add`: Add a pattern with its label and benefit; the benefit specifies its privilege,
  - `get`: Get a pattern by label,
  - `rewrite`: Apply the rewriting patterns contained to a network.

- `AnalysisPatternManager`:
  - `add`: Add a pattern with its label and benefit; the benefit specifies its privilege,
  - `get`: Get a pattern by label,
  - `analyze`: Apply the analysis patterns contained to a network.

### @record_signature to Decorate Functionals Requiring FLayerInfo

The `@record_signature` decorator is used to record the `FLayerInfo` for a functional. While FLayerInfo is vital for GW when analyzing or rewriting certain functionals, it is used in an "add as needed" manner. If you are adding GW patterns, ensure that the functional requires the `@record_signature` decorator.

## Classical Workflow

There are specific routines for defining a GW pattern. Let's start with a simple example: replacing a sum layer with a subtract layer, which can also be found in the `test_graph_rewriting.py` file.

```python
class NaivePatternRewriter_ReplaceAddWithSub(PatternRewriter):

    def __init__(self):
        super().__init__('replace_add_with_sub',
                         root_layer={trt.LayerType.ELEMENTWISE},
                         separate_match_rewrite=True)

    def match(self, layer: Layer):
        # The rewriter will stop at the first matched layer, and then the Rewriter will enter the rewrite() to do the rewriting.
        return layer.as_layer().op == trt.ElementWiseOperation.SUM

    def rewrite(self, layer: Layer) -> None:
        # The layer here should be an Elementwise_SUM layer.
        with net_guard(layer.network):
            # There are several stages to replace some subgraph with another subgraph:

            # Stage 1: Get the input tensors and output tensors of the subgraph to replace.
            # - For Elementwise_SUM, there are two inputs and one output.
            a, b = layer.get_inputs(0, 1)
            o = layer.get_outputs(0)[0]

            # Stage 2: Create a new subgraph that takes the old one's inputs.
            # - Here we insert an Elementwise_SUB layer, and 'c' is the output.
            c = a - b

            # Stage 3: Redirect all the layers depending on the outputs of the old subgraph to the new subgraph's.
            # - After this, the SUM becomes dangling and will be pruned by TensorRT when building the engine.
            # - Note that there is no API in TensorRT python to remove a layer explicitly; `replace_all_uses_with` is the only way to "remove" a layer.
            o.replace_all_uses_with(c)

            # Stage 4: Mark all the layers in the old subgraph as removed.
            # - This helps the PatternRewriter to skip the removed layers.
            layer.mark_as_removed()
```

In this example, we deal with `ILayer` rather than Plugins, so `FLayerInfo` is unnecessary. As illustrated in the `rewrite` method, there are four stages that are shared across nearly all rewrite patterns.

Note that in GW, we **NEVER** rewrite a layer directly. Instead, we do it in two steps: first, create another layer with the same input and deprive all the users of the original outputs, redirecting them to the outputs of the new layers. In this way, the old layer will be dangling and pruned automatically by TensorRT during the engine building phase. This is a limitation of TensorRT since remove-layer-like APIs are not available in Python.

In Stage 2, we rely on operators and layers commonly used during the network building phase. Ideally, you can replace them with any network structure during GW.

For the usage of `FLayerInfo`, let's rewrite the `gpt_attention` to enable the `remove-padding` feature. `gpt_attention` is actually

 a TensorRT plugin, so we need `FLayerInfo` to hold the original Tensor-wise inputs to help create new `gpt_attention` layers.

```python
class GPTAttentionPluginRemovePaddingRewritePass(PatternRewriter):

    def __init__(self):
        super().__init__('gpt_attention_plugin_remove_padding',
                         root_layer={trt.LayerType.PLUGIN_V2})

    def match_and_rewrite(self, layer: Layer) -> bool:
        if layer.as_layer().type != trt.LayerType.PLUGIN_V2 or \
                layer.as_layer().plugin.plugin_namespace != 'tensorrt_llm' or \
                layer.as_layer().plugin.plugin_type != 'GPTAttention':
            return False

        # Retrieve the FLayerInfo
        flayer = FLayerInfoMemo.instance().get(layer.name)
        assert flayer
        # Although the layer is a plugin, which is a black box, we get some high-level input information from the FLayerInfo.
        tensor_input: Tensor = flayer.get_input('qkv')
        if tensor_input.shape[0] == 1:  # Already in remove-padding mode
            return False

        # Some information could be passed in from external
        assert self.args is not None, "args should be passed in from RewritePatternManager.rewrite()"
        batch_size, in_len, hidden_size = self.args['batch_size'], self.args['in_len'], self.args['hidden_size']

        with net_guard(layer.network):
            new_inputs = flayer.clone_inputs()

            # Step 1: Create new inputs and replace the original arglist.
            input = Tensor(
                name='qkv',
                dtype=trt.float16,
                shape=(1, batch_size * in_len, hidden_size),
            )
            new_inputs['qkv'] = input

            # Step 2: Create a new plugin instance.
            new_outs = gpt_attention(**new_inputs)

            # Step 3: Deprive all the users of the old plugin instance.
            flayer.replace_outputs_uses_with(layer.network, new_outs)

            # Step 4: Remove the old plugin instance.
            layer.mark_as_removed()

        return True
```

This is quite similar to the first example, with the focus on the `FLayerInfo` part. Through the code below, we can get the original inputs of this layer, enabling us to alter the inputs related to remove-padding and create a new layer to replace it.

```python
flayer = FLayerInfoMemo.instance().get(layer.name)
assert flayer
```

```python
new_inputs = flayer.clone_inputs()

# Step 1: Create new inputs and replace the original arglist.
input = Tensor(
    name='tensor',
    dtype=trt.float16,
    shape=(1, batch_size * in_len, hidden_size),
)
new_inputs['tensor'] = input

# Step 2: Create a new plugin instance.
new_outs = gpt_attention(**new_inputs)
```

For real examples, please refer to the `FuseAttentionWithBiasPass` in the `graph_rewriting.py`.

---

(kv-cache-management)=

# KV Cache Management: Pools, Blocks, and Events

This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.

For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).

---

## Hierarchy: Pool, Block, and Page

### **Block**
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.

### **Page**
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.

### **Pool**
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.

### **WindowBlockManager/BlockManager**

TRT-LLM supports 2 complex features related to KV cache management:
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.

In order to support both of these features, the pool management works as described below.

But in the simple, *most common case*, for most models, where
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,

only a *single* pool will be created within the structure described below.

#### KV Cache Pool Management

- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.

**Hierarchy Summary:**
- **Pool** (memory buffer for KV data)
  - Contains many blocks.
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
    - (Optionally, blocks can be swapped between primary/secondary pools.)
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.

---

## Events in `KVCacheEventManager`

The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.

### **Types of Events**
- **Created Event:** When pools or blocks are created/allocated.
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
- **Removed Event:** When a block is removed from the cache (evicted or released).
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).

### **What Triggers an Event?**
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).

**In summary:**
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.

---

---

(kv-cache-reuse)=

# KV cache reuse

This document describes how kv cache pages can be shared and reused by requests that start with the same prompt. This can greatly lower first token latency, the time it takes before the first output token is generated. Many use cases can benefit from this, including multi-turn requests and system prompts.

## How to enable kv cache reuse

There are two steps to enabling kv cache reuse.

1. Model must support it

KV cache reuse requires the model to be built for paged context attention. This is done with `trtllm-build`:

```trtllm-build --use_paged_context_fmha enable```

2. KV cache reuse is enabled by default in KVCacheManager

If you are running gptManagerBenchmark application, you can disable kv cache reuse with a command-line switch:

```gptManagerBenchmark --enable_kv_cache_reuse enable=false```

If you are running a Triton server, you can enable kv cache reuse with a parameter:

```
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "true"
  }
}
```

If you are writing your own application using Executor API, you can enable kv cache reuse by including `enableBlockReuse=true` when you create the `KvCacheConfig` object. Note that this is the default, if you wish to disable kv cache reuse, pass `enableBlockReuse=false` instead.

### Enable kv cache reuse for p-tuning

When using p-tuning, different requests may use same fake input ids (i.e. prompt ids whose values are larger than vocabulary size). That may lead to incorrect kv cache reuse, since TRT-LLM could not distinguish these requests only by input ids. To enable kv cache reuse for p-tuning correctly, users should provide an extra id (uint64) for each input id. Extra ids for normal input ids (i.e. text token ids) should always be 0, while fake input ids should have extra ids which are larger than 0. Requests using same prompt embeddings should use same extra ids, while requests using different prompt embeddings should use different extra ids.

Example:
Assume vocabulary size is 100, which means normal text token ids are in range [0, 99] and prompt ids start from 100.

```python
# Request 1 uses prompt embedding table 1
input_ids = [100, 101, 102, 103, 1, 2, 3, 4]
extra_ids = [1,   1,   1,   1,   0, 0, 0, 0]

# Request 2 uses prompt embedding table 2
input_ids = [100, 101, 102, 103, 1, 2, 3, 4]
extra_ids = [2,   2,   2,   2,   0, 0, 0, 0]

# Request 3 uses prompt embedding table 1 and different text tokens
input_ids = [100, 101, 102, 103, 5, 6, 7, 8]
extra_ids = [1,   1,   1,   1,   0, 0, 0, 0]
```

## Performance expectations

KV cache state can be reused when two requests start with the same partial prompt. This reduces first token latency, the time it takes until the first output token is generated. Bigger savings are realized when the shared prompt is longer, relative to the overall prompt length. The biggest saving is realized when two identical requests are run back-to-back, in which case the latency for the first output token approaches latency for subsequent tokens.

## Situations that can prevent kv cache reuse

There are a few pitfalls that can prevent kv cache reuse when that seems possible. KV cache state only becomes reusable after the request that computed the state terminates. If you have a shared system prompt, the first request will compute kv cache state for the system prompt, the second request will reuse it, but only if the second request launches after the first request completed. If you run with a large batch-size, it is likely that many requests that share a common system prompt will be launched before the first request has terminated. No reuse will occur until one of the requests terminate, then subsequently scheduled requests can reuse.

Kv cache state for system prompts will remain reusable until memory is needed for launching a new request or propagating an existing one. When this happens, reusable blocks are evicted based on LRU. System prompts that are frequently used have a better chance of remaining reusable, but there is no guarantee since launching new requests take priority over possible reuse. Running with a larger batch size, or larger output sequence lengths for example will reduce the probability of kv cache blocks being reused, since it increases memory needs.

KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. Partially matched blocks can also be reused, but that creates a new copy of the block for each sequence. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example

```trtllm-build --tokens_per_block 32 ...```

will create a model where one KV cache block can hold 32 tokens. Note that tokens_per_block must be a power of 2.

## Offloading to host memory

Offloading to host memory increases likelihood of kv cache reuse. Reusable blocks that are needed for higher priority tasks, like propagating an already running request, are copied to a buffer in host memory instead of being evicted. This greatly extends the amount of memory available for reuse, allowing blocks to remain reusable much longer. On the other hand, offloading of blocks (and subsequent onboarding when a block is reused) has some cost since the blocks must be copied from CPU to GPU memory and vice versa. This cost is negligible on Grace-Hopper machines, and small enough to yield a net benefit for many use cases on x86 machines with Hopper GPUs. Offloading is unlikely to yield benefits on older architectures because of the (relatively) slow link between GPU and host memory.

If you are running gptManagerBenchmark, you can enable offloading with a command-line switch. For example,

```gptManagerBenchmark --kv_host_cache_bytes 45000000000```

will create a 45 GiB offloading buffer in host memory. Note that this buffer is pinned memory, allocating a lot of pinned memory on x86 machines can take a substantial amount of time (10s of seconds). This is a one-time cost.

If you are running a Triton server, you can enable offloading to host memory with the kv_cache_host_memory_bytes parameter. For example, adding this to your model config file will create a 45 GiB offloading buffer in host memory.

```
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "45000000000"
  }
}
```

If you are writing your own application using Executor API, you can enable offloading to host by including `hostCacheSize=45000000000` when you create the `KvCacheConfig` object. This will create a 45 GiB offloading buffer in host memory.

---

(lora)=

## Run gpt-2b + LoRA using Executor / cpp runtime

First build a model with LoRA and inflight-batching enabled.

```bash
git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
BASE_MODEL=llama-7b-hf

python examples/models/core/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
    --output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
    --dtype float16

trtllm-build --checkpoint_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
    --output_dir /tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/ \
    --remove_input_padding enable \
    --gpt_attention_plugin float16 \
    --context_fmha enable \
    --paged_kv_cache enable \
    --gemm_plugin float16 \
    --lora_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 512 \
    --max_seq_len 562 \
    --lora_dir Japanese-Alpaca-LoRA-7b-v0 \
    --max_lora_rank 8 \
    --lora_target_modules "attn_q" "attn_k" "attn_v"
```

To pass LoRAs into the cpp runtime they must be converted to the format below.
The script below will convert a Hugging Face LoRA model to the correct NumPy tensor.

```bash
python3 tensorrt_llm/examples/hf_lora_convert.py -i Japanese-Alpaca-LoRA-7b-v0 -o Japanese-Alpaca-LoRA-7b-v0-weights --storage-type float16
python3 tensorrt_llm/examples/hf_lora_convert.py -i luotuo-lora-7b-0.1 -o luotuo-lora-7b-0.1-weights --storage-type float16
```

Refer to the [tensorrtllm_backend documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/lora.md) for a Multi-LoRA example using Triton.

### LoRA tensor format details

To run inference using `Executor`, a `Request` must have a `LoraConfig` that contains a `task_id`, `weights` and `config` parameters.

`task_id` the unique task ID for the given LoRA.

To perform inference with a specific LoRA for the first time, `task_id`, `weights`, and `config` must all be given. The LoRA will be cached, so that subsequent requests for the same task only require `task_id`.
If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if `task_id` is not cached.

`weights` contains the weights for all the LoRAs. Currently, this should include weights for all TP and PP ranks.
The weights tensor has the shape `[num_lora_modules_layers, D x Hi + Ho x D ]`. The last dimension holds the in / out adapter weights for the associated module (for example, `attn_qkv`) and model layer.

Each of the in / out tensors are first flattened and then concatenated together in the format above.
The first dimension (of size `num_lora_module_layers`) has an entry for each module-layer (that is, there is an entry for `attn_q layer1` and another for `attn_k layer1`).

`D=adapter_size (i.e. R value), Hi=hidden_size_in, Ho=hidden_size_out.`

`config` is a configuration tensor which identifies the moduleId, layerId, and adapter size of each element of `LoraWeights`. It has the shape `[num_lora_modules_layers, 3]`. The last dimension holds `[module_id, layer_idx, adapter_size D (i.e. R value)]`.

This feature supports LoRAs as described in https://arxiv.org/pdf/2106.09685.pdf

#### Example LoRA tensors

Here is an example of `LoraWeights` and `LoraConfig` tensors for a model with tp=1, pp=1, 4 layers, and a hidden size of 4.
The following tensors are for a LoRA which has a `q` and `k` adapter.

```
# loraConfig
[
  [1, 0, 2]
  [2, 0, 4]
  [1, 1, 2]
  [2, 1, 4]
  [1, 2, 2]  # Note that the final 2 layers only adapt `q`
  [1, 3, 8]
]
# Note: The loraConfig tensor configures the loraWeights tensor.
#       The contents of each row of loraWeights is specified be the corresponding row in loraConfig

# loraWeights
# Note: that 'in weights' and 'out weights' are 'A' and 'B' in the LoRA paper.
[
  [ <2 x 4 in weights>, <4 x 2 out weights> <padding> ]  # `q` adapter for layer 0
  [ <4 x 4 in weights>, <4 x 4 out weights> <padding> ]  # `k` adapter for layer 0
  [ <2 x 4 in weights>, <4 x 2 out weights> <padding> ]  # `q` adapter for layer 1
  [ <4 x 4 in weights>, <4 x 4 out weights> <padding> ]  # `k` adapter for layer 1
  [ <2 x 4 in weights>, <4 x 2 out weights> <padding> ]  # `q` adapter for layer 2
  [ <8 x 4 in weights>, <4 x 8 out weights>           ]  # `q` adapter for layer 3. Note the final layer has a adapter size of 8
]

```

#### LoRA Module id mapping

| module name (as specified in `convert_checkpoint.py` scripts) | module id | description |
| --------------------------------------------- | --------- | ----------- |
| attn_qkv | 0 | compbined qkv adapter |
| attn_q | 1 | q adapter |
| attn_k | 2 | k adapter |
| attn_v | 3 | v adapter |
| attn_dense | 4 | adapter for the dense layer in attention |
| mlp_h_to_4h | 5 | for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection |
| mlp_4h_to_h | 6 | for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection |
| mlp_gate | 7 | for llama2 adapter for gated mlp later after attention / RMSNorm: gate |
| cross_attn_qkv | 8 | compbined qkv adapter for cross attention |
| cross_attn_q | 9 | q adapter for cross attention |
| cross_attn_k | 10 | k adapter for cross attention |
| cross_attn_v | 11 | v adapter for cross attention |
| cross_attn_dense | 12 | adapter for the dense layer in cross attention |
| moe_h_to_4h | 13 | for mixtral adapter for expert mlp layer: up projection |
| moe_4h_to_h | 14 | for mixtral adapter for expert mlp layer: down projection |
| moe_gate | 15 | for mixtral adapter for expert mlp layer: gate |
| moe_router | 16 | for mixtral adapter for expert router layer |
| mlp_router | 17 | for qwen2-moe adapter for shared expert gate layer |
| mlp_gate_up | 18 | adapter for gated mlp layer after attention / RMSNorm: gate + up projection |

#### LoraCache configuration

The core idea is that we will have a fixed size, 2-level LoRA cache in TRT-LLM. The higher level cache resides on the host and the lower level is on GPU (distinct from the existing KV cache). Sizes of both are user configurable.

The CPU cache is configured to be a max size.  The GPU cache is configured to a percentage of free GPU memory after engine load. As requests come in LoRAs are stored in the host cache.

As requests are scheduled for execution LoRAs are loaded into the GPU cache.

#### LoRA with tensor parallel

The partition of tensor parallel for LoRA is special. There are two cases: `RowLinear` and `ColumnLinear`. Assume we have a linear layer and the input feature size is `K` and the output feature size is `N`. Then, the shape of the weight is `[K, N]`.

First, consider this linear layer is a `ColumnLinear` layer. When we partition the weight, we split the weight by column with `tp_size`. Then, there are `tp_size` split weights and the shapes of these weights are `[K, N // tp_size]`. When we apply LoRA adapter on such `ColumnLinear` layer, the shapes of original two weights are `[K, lora_rank]` and `[lora_rank, N]`. So, we only partition the second weight and get `tp_size` split weights with shapes `[lora_rank, N // tp_size]`. For the first weight, each GPU maintains the same entire weight (with shape `[K, lora_rank]`).

Next, consider this linear layer is a `RowLinear` layer. When we partition the weight, we split the weight by row with `tp_size`. Then, there are `tp_size` split weights and the shapes of these weights are `[K // tp_size, N]`. When we apply LoRA adapter on such `RowLinear` layer, the shapes of original two weights are `[K, lora_rank]` and `[lora_rank, N]`. So, we only partition the first weight and get `tp_size` split weights with shapes `[K // tp_size, lora_rank]`. For the second weight, each GPU maintains the same entire weight (with shape `[lora_rank, N]`).

#### DoRA

TensorRT LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.

The DoRA scales must be normalized before they are submitted to TensorRT LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.

When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.
The shape of `LoraWeights` becomes `[num_lora_modules_layers, D x Hi + Ho x D + Ho]`, and the last `Ho` values are the DoRA scale vector.

---

# Low-Precision-AllReduce

```{note}
Note:
This feature is optimized for PCIe-based GPU topologies and may affect model accuracy. Please evaluate precision impact for your specific workload.
```


TRT-LLM supports `low-precision-allreduce`, a communication optimization that accelerates AllReduce operations in PCIe-based GPU environments. This feature quantizes FP16/BF16 data to FP8 during network transmission, reducing communication volume and improving performance.

## Algorithm

The Low-Precision-AllReduce algorithm works by:
1. Quantizing input FP16/BF16 tensors to FP8 format before network transmission


   **Quantization details**: We use a "per-warp" quantization approach where each CUDA warp (32 threads) processes a batch of data. In each warp, 31 threads quantize FP16/BF16 values to FP8 e4m3 format (16 bytes per thread), while the last thread transmits a scalar value. This results in each warp collectively quantizing 496 elements plus one scalar at a time.

2. Transmitting the quantized data through the network
3. Dequantizing received data back to the original precision
4. Performing the reduction operation

In 8-GPU scenarios, this approach shifts the communication bottleneck from cross-NUMA QPI to the PCIe switch, resulting in better overall performance.

## Topology Requirements

![8x L20/L40s Node Architecture](images/8x_l20_L40S_node_architecture.png)

Low-Precision-AllReduce is specifically designed for the topology shown above, where:
- Each node contains 2 NUMA domains
- Each NUMA domain has 4 GPUs connected via PCIe switch
- GPUs within the same NUMA node communicate via the PCIe switch

**Important:** This optimization will not accelerate performance in different topologies (e.g., where each GPU is in a separate NUMA domain).

## Usage

The Low-Precision-AllReduce algorithm can be enabled in two ways:

1. **Direct specification** in your code:
```
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.LOWPRECISION);
```

2. Enable by LlmArgs
```
Set allreduce_strategy field in LlmArgs.
Candidates of strategies are "AUTO", "NCCL", "UB", "MINLATENCY", "ONESHOT", "TWOSHOT", "LOWPRECISION" and "MNNVL".
If no strategy is set, AUTO will be set.
```

## Performance and Accuracy Considerations

Low-Precision-AllReduce reduces communication volume by using FP8 data format for transmission. This optimization:
- Improves performance for large message sizes in PCIe-based topologies
- May slightly reduce numerical precision
- Automatically falls back to other strategies when no performance benefit is expected (e.g., with NVLink or small messages)

Users should evaluate the precision impact on their specific models and workloads.

**Note**: When compiling TensorRT-LLM without enabling the `ENABLE_FP8` option, setting Low Precision allreduce will not take effect.

---

We have recently open-sourced a set of Cutlass kernels that were previously known as "internal_cutlass_kernels". Due to internal dependencies, these kernels were previously only available to users as static libraries. We have now decoupled these internal dependencies, making the kernels available as source code.

The open-sourced Cutlass kernels are on the path `cpp/tensorrt_llm/kernels/cutlass_kernels`, including:
- `low_latency_gemm`
- `moe_gemm`
- `fp4_gemm`
- `allreduce_gemm`

To ensure stability and provide an optimized performance experience, we have maintained the previous method of calling these kernels via static libraries as an alternative option. You can switch between open-sourced Cutlass kernels and static library Cutlass kernels through the `USING_OSS_CUTLASS_*` macro (where * represents the specific kernel name), enabling kernel-level control. By default, the open-source Cutlass kernels are used.
Note that support for these static libraries will be gradually deprioritized in the future and may eventually be deprecated.

**Default Configuration (Using open-sourced Cutlass Kernels)**

To build using the open-source Cutlass kernels (default setting), run:

```bash
python3 ./scripts/build_wheel.py --cuda_architectures "90-real;100-real"
```

**Using Static Library Cutlass Kernels**

If you prefer to use the Cutlass kernels from the static library, you can control this during compilation by setting the `USING_OSS_CUTLASS_*` macro to `OFF`. For example, to use the static library implementation specifically for `low_latency_gemm` and `moe_gemm` while keeping other kernels as OSS, use the following compilation command:

```bash
python3 ./scripts/build_wheel.py --cuda_architectures "90-real;100-real" -D "USING_OSS_CUTLASS_MOE_GEMM=OFF;USING_OSS_CUTLASS_LOW_LATENCY_GEMM=OFF"
```

---

# Speculative Sampling

- [About Speculative Sampling](#about-speculative-sampling)
- [Performance Improvements](#Performance-improvements)
- [Draft-Target-Model](#Draft-Target-Model)
- [NGram](#ngram)
- [Medusa](#medusa)
  - [Medusa Tree](#medusa-tree)
  - [Using Medusa with TensorRT-LLM](#using-medusa-with-tensorrt-llm)
    - [Limitations](#limitations)
- [ReDrafter](#redrafter)
- [EAGLE](#eagle)
    - [Disaggregated Serving](#disaggregated-serving)
- [Lookahead decoding](#lookahead-decoding)

## About Speculative Sampling

Speculative Sampling (also referred to as Speculative Decoding) is a set of techniques designed to allow generation of more than one token per forward pass iteration. This can lead to a reduction in the average per-token latency **in situations where the GPU
is underutilized due to small batch sizes.**

Speculative Sampling involves predicting a sequence of future tokens, referred to as draft tokens, using a method
that is substantially more efficient than repeatedly executing the target Large Language Model (LLM).
These draft tokens are then collectively validated by processing them through the target LLM in a single forward pass.
The underlying assumptions are twofold:

1. processing multiple draft tokens concurrently will be as rapid as processing a single token
2. multiple draft tokens will be validated successfully over the course of the full generation

If the first assumption holds true, the latency of speculative decoding will no worse than the standard approach. If the second holds, output token generation advances by statistically more than one token per forward pass.
The combination of both these allows speculative decoding to result in reduced latency.

TensorRT-LLM supports several approaches for generating draft tokens, including:

1. Utilizing a smaller, auxiliary model, known as the draft model approach. For more information, refer to the [Fast Inference from Transformers via Speculative Decoding paper](https://arxiv.org/pdf/2211.17192.pdf).
2. Implementing additional language model heads that predict tokens for future positions:
    1. [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads paper](https://arxiv.org/abs/2401.10774).
    2. [Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/html/2403.09919v1).
    3. [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077).
3. Utilizing prompt tokens as draft tokens. For more information, refer to [NGram](https://github.com/apoorvumang/prompt-lookup-decoding/).
4. Utilizing Jacobi-like decoding to predict and verify draft tokens using the same model which does not need additional fine-tuning. Refer to [Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/pdf/2402.02057).


## Performance Improvements

It's important to note that the effectiveness of speculative decoding techniques is highly dependent
on the specific task at hand. For instance, forecasting subsequent tokens in a code-completion scenario
may prove simpler than generating a summary for an article.

Furthermore, when integrating Medusa with a standard PyTorch model implementation which may not be as finely
tuned as TensorRT-LLM, the potential time savings are more pronounced.

## Draft-Target-Model

The Draft-Target-Model involves the use of two distinct models (a smaller Draft model and a larger Target model) trained independently but sharing the same vocabulary. For example, GPT 125M / 6.7B models serve as the Draft / Target model.

The management of Draft and Target models is facilitated through two separate `Executor` instances.
It is essential that you to coordinate the interactions between the Draft and Target models effectively.
Initially, the Draft model is queried to generate up to `K` draft tokens.
These tokens are then forwarded to the Target model for verification.
Upon verification, the Target model may return up to `K+1` tokens.
Subsequently, the prompt, now updated with the accepted tokens, is sent back to the Draft model to initiate the generation of new draft tokens.
This iterative process continues until a predefined stop conditions are met.
An example orchestration script is available in the Triton backend repository’s
[draft-target-model client example](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/client/python/draft_target_model_client.py).

We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).

## NGram

The NGram speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The NGram profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.

See document in [examples/ngram/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).

## Medusa

This approach leverages a single model to both generate and verify draft tokens.
It enhances the existing model by adding multiple extra language model heads, known as Medusa heads.
These additional heads are trained to predict future tokens while the base model remains unchanged.
Specifically, the first Medusa head is tasked with predicting the immediate next token,
the second head predicts the token after that, and so on.
With `K` Medusa heads, the model can forecast up to `K` tokens ahead.
The draft tokens generated by the Medusa heads during iteration `i`
are then verified and potentially accepted in the subsequent iteration, `i+1`.

The true potential of the Medusa strategy is realized when more than one token per head is used,
employing a TopK approach to create multiple potential paths, essentially forming a tree, rather than
a single linear path as seen in the Draft model approach. To reduce redundant computations, many of these paths,
which often share common prefixes, are consolidated into a single path.
This is achieved by applying attention with a sparse mask that represents the various paths. Sparse mask formed by Medusa tree is described in detail later.

By validating multiple paths simultaneously, there is an increased likelihood of accepting more than one token per iteration,
albeit at the expense of additional computational effort.

It is crucial to recognize that as the number of potential paths grows exponentially with `K`,
it is not necessary to explore or validate all of them. A recommended strategy for managing this complexity is to prune the tree
by focusing only on the paths with higher-probability tokens.

You must strike a balance between the breadth and depth of the tree you want to explore and the impact of a larger tree on the overall
performance for your specific application.

In the TensorRT-LLM implementation of Medusa, the configuration of the tree is a runtime parameter.
This flexibility allows you to experiment and identify the optimal tree structure for your use case,
which can then be utilized in a production environment.

### Medusa Tree

Consider the following diagram, which illustrates how the hidden states from the last layer of the base model
are passed to the base model's language model (LM) head and to four Medusa heads (MHs).

<p align="center">
    <img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/media/medusa_tree.svg?raw=true" alt="Example Medusa Tree" width="auto" height="auto">
</p>

In this example:

1. The token <code>l<sub>0</sub></code> represents the actual token generated by the model.
All other tokens, denoted as <code>p<sub>hk</sub></code>, are predictions from the MHs,
where `h` indicates the Medusa head index (1-based) and `k` represents the TopK choice index (0-based).
1. Four MHs are used, which means the model is predicting four future tokens.
2. The first two MHs utilize Top-2 predictions, while the last two use Top-1.
For instance, <code>p<sub>10</sub></code> and <code>p<sub>11</sub></code> are the top and
second top predictions from the first Medusa Head (MH1).
1. A total of four paths are explored, which is fewer than the 16 that would be examined
if a complete binary tree were used (assuming Top-2 predictions for all MHs).
1. As some of these paths may be accepted, there are ten potential candidates, referred to as `medusa_choices`.
The number of tokens that can be accepted at each step, including the true token,
ranges from 1 (if all Medusa predictions are incorrect) to 5 (if all are correct).

During the generation phase, the model receives an input of 10 tokens,
which corresponds to the last tokens of each candidate path, rather than just one.

In TensorRT-LLM, you have the option to define such trees by providing all the Medusa choices
or by simply specifying the unique paths.

- Since each candidate/path begins with the true token (<code>l<sub>0</sub></code>),
there is no need to specify it separately. For the predicted tokens, only the TopK indices are required.
- For example, to specify the path <code>l<sub>0</sub>p<sub>10</sub>p<sub>21</sub>p<sub>30</sub></code>,
one would use `[0,1,0]`. And
to specify the path <code>l<sub>0</sub>p<sub>11</sub>p<sub>20</sub></code>,
one would use `[1,0]`.
- To specify all 4 paths in the example, use `medusa_choices=[[0,0,0,0], [0,1,0], [1,0], [1,1]]`.
- It's also possible to specify all candidates explicitly, similar to the Medusa repository.
For instance, `medusa_choices=[[0], [0,0], [0,0,0], [0,0,0,0], [0,1],
[0,1,0], [1], [1,0], [1,1]]`. Note that when specifying all the candidates explicitly, **we don't include
the empty `[]` candidate** for the case where only the true token is accepted, that is, all the predictions from MHs are wrong.
So, only `9` candidates are specified.

**Specifying paths-only instead of all choices is currently supported only in the Python runtime.**

### Using Medusa with TensorRT-LLM

For guidance on constructing and executing Medusa with the Python runtime, consult the [Medusa README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/medusa/README.md). When utilizing the Inflight Fused Batching (IFB) with the C++ API, it is necessary to define the `medusa_choices` explicitly within the model configuration. For detailed instructions, refer to the [model configuration in TensorRT-LLM backend](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration) for more details.

#### Limitations

- TensorRT-LLM supports Medusa only for Vicuna (fine tuned LLaMA).
However, similar to any new model, you can follow the same approach to define your own Medusa model and deploy with TensorRT-LLM.
- We match only tokens during the validation phase that is `medusa_temperature=0`.
- Beam search is **not** compatible with Medusa.


## ReDrafter

The ReDrafter approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. However, unlike Medusa, it predicts draft tokens using a recurrent predictor, where each draft token depends on the previous one. This method also allows the use of beam search to identify more prominent draft tokens. For more details, please read [the ReDrafter paper](https://arxiv.org/html/2403.09919v1).

TensorRT-LLM implements the ReDrafter model such that logits prediction, beam search, and draft token acceptance are performed inside the TensorRT engine. This contrasts with standard model inference, which only predicts logits and performs decoding outside the engine. Since the engine predicts explicit draft tokens instead of implicit tokens decoded from logits, we categorize this speculative decoding method as `explicit_draft_tokens`. Please, visit the [ReDrafter README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/redrafter/README.md) for information about building and running the model. ReDrafter supports both Inflight Fused Batching runtime and Python static batching runtime.

## EAGLE

The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.

Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

### Disaggregated Serving

[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.

## Lookahead Decoding

Lookahead decoding algorithm operates through two parallel computation branches within the same model: a lookahead branch that generates n-grams using a fixed-sized 2D window, and a verification branch that validates promising n-gram candidates. This approach eliminates the necessity for additional model training or fine-tuning and can be enabled for any autoregressive model. Refer to the [Lookahead decoding README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/lookahead/README.md) for information about building and running the model.

---

(weight-streaming)=

## Running With Weight Streaming to Reduce GPU Memory Consumption

TensorRT Weight Streaming can offload some weights to the CPU memory and stream them to the GPU memory during runtime.
This can reduce the weights size in GPU memory, therefore, we can run larger models or larger batch sizes in the same GPU memory budget.


During build time, build the engine with `--weight-streaming --gemm_plugin disable` since Weight Streaming only supports non-plugin weights. During runtime, run with `--gpu_weights_percent x` to config the percent of weights that remained on the GPU. `x` can be a value from `0.0` to `1.0`.

Here is an example to run llama-7b with Weight Streaming:
```bash

# Convert model as normal. Assume hugging face model is in llama-7b-hf/
python3 examples/models/core/llama/convert_checkpoint.py \
    --model_dir llama-7b-hf/ \
    --output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
    --dtype float16

# Build engine that enabled Weight Streaming.
trtllm-build \
    --checkpoint_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
    --output_dir /tmp/llama_7b/trt_engines/fp16/1-gpu/ \
    --weight_streaming \
    --gemm_plugin disable \
    --max_batch_size 128 \
    --max_input_len 512 \
    --max_seq_len 562

# Run the engine with 20% weights in GPU memory.
python3 examples/summarize.py \
    --engine_dir /tmp/llama_7b/trt_engines/fp16/1-gpu/ \
    --batch_size 1 \
    --test_trt_llm \
    --hf_model_dir llama-7b-hf/ \
    --data_type fp16 \
    --gpu_weights_percent 0.2

```


### API Changes

To build engines with Weight Streaming enabled, some API changes are needed for the builder:
- Added a new bool member `weight_streaming` to class `BuildConfig`.
- Added a new bool parameter `weight_streaming` to method `create_builder_config` of class `Builder`.

To run with Weight Streaming with `Executor`, there are some API change to its config `ExecutorConfig`:
- Added a new float parameter `gpuWeightsPercent` to the constructor of `ExecutorConfig`.
- Added two member functions `setGpuWeightsPercent` and `getGpuWeightsPercent` to set and get the GPU weights percentage.

Here is an example to create an `Executor` with Weight Streaming:
```c++
...
auto executorConfig = tle::ExecutorConfig(gpuWeightsPercent=0.5);
auto executor = tle::Executor("model_path", tensorrt_llm::executor::ModelType::kDECODER_ONLY, executorConfig);
...
```

---

(add-model)=

# Adding a Model

This document describes how to add a typical decoder-only model in TensorRT LLM.

## Step 1. Write Modeling Part

TensorRT LLM provides different levels of APIs:

- Low-level functions, for example, `concat`, `add`, and `sum`.
- Basic layers, such as, `Linear` and `LayerNorm`.
- High-level layers, such as, `MLP` and `Attention`.
- Base class for typical decoder-only models, such as, `DecoderModelForCausalLM`.

1. Create a model directory in `tensorrt_llm/models`, for example `my_model`.
2. Write a `model.py` with TensorRT LLM's APIs

```python
class MyDecoderLayer(Module):
    def __init__(self, config: PretrainedConfig, layer_idx: int):
        self.layer_idx = layer_idx
        self.config = config
        self.input_layernorm = LayerNorm(...)
        self.attention = Attention(...)
        self.post_layernorm = LayerNorm(...)
        self.mlp = MLP(...)

    def forward(self, hidden_states, ...):
        # decoder layer forward
        return hidden_states

class MyModel(Module):
    def __init__(self, config: PretrainedConfig):
        self.config = config
        self.vocab_embedding = Embedding(...)
        self.layers = DecoderLayerList(MyDecoderLayer, config)
        self.ln_f = LayerNorm(...)

    def forward(self, input_ids, ...):
        # model forward
        return hidden_states


class MyModelForCausalLM(DecoderModelForCausalLM):
    def __init__(self, config: PretrainedConfig):
        transformer = MyModel(config)
        lm_head = ColumnLinear(...)
        super().__init__(config, transformer, lm_head)
```


## Step 2. Implement Weight Conversion

The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights:

```python
class MyModelForCausalLM(DecoderModelForCausalLM):
    @classmethod
    def from_hugging_face(
            cls,
            hf_model_dir,
            dtype='float16',
            mapping: Optional[Mapping] = None) -> MyModelForCausalLM
        # create a TensorRT LLM MyModelForCausalLM model object
        # convert HuggingFace checkpoint to TensorRT LLM expected weights dict
        # load the weights to MyModelForCausalLM object
```

It's optional to develop a `convert_checkpoint.py` script in the `examples/my_model/` directory for the convenience of offline weights conversion.

## Step 3. Register New Model

Please register the new model class `MyModelForCausalLM` in `tensorrt_llm/models/__init__.py`.

## Step 4. Verify New Model

At last, let's verify the new model. The typical commands are as following:

```bash
cd examples/my_model/

python convert_checkpoint.py --model_dir hf_model_dir --output_dir tllm_ckpt_dir

trtllm-build --checkpoint_dir tllm_ckpt_dir --output_dir tllm_engine_dir

# try the model with a single prompt
python ../run.py --engine_dir tllm_engine_dir --tokenizer_dir hf_model_dir --input_text "Born in north-east France, Soyer trained as a"
# run summarization task
python ../summarize.py --engine_dir tllm_engine_dir --hf_model_dir hf_model_dir --test_trt_llm
```

## Reference

It's recommended to read the workflow[./workflow.md] and checkpoint[./checkpoint.md] documents for more details.

---

# TensorRT LLM Checkpoint

## Overview

The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT LLM. This file documents the workflow around TensorRT LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.

There are three steps in the workflow:

1. Convert weights from different source frameworks into TensorRT LLM checkpoint.
2. Build the TensorRT LLM checkpoint into TensorRT engines with a unified build command.
3. Load the engines to TensorRT LLM model runner and evaluate with different evaluation tasks.

```
NeMo -------------
                  |
HuggingFace ------
                  |   convert                             build                    load
Modelopt ---------  ----------> TensorRT LLM Checkpoint --------> TensorRT Engine ------> TensorRT LLM ModelRunner
                  |
JAX --------------
                  |
DeepSpeed --------
```

## Prepare the TensorRT LLM Checkpoint

TensorRT LLM aims at supporting different sources:

1. Trained models from NVIDIA NeMo, Microsoft DeepSpeed, and JAX
2. Quantized models from NVIDIA Modelopt
3. Popular models from HuggingFace

TensorRT LLM defines its own checkpoint format. A checkpoint directory includes:

1. One config `json` file, which contains several model hyper-parameters.
2. One or several rank weights files, each file contains a dictionary of tensors (weights).
The different files are loaded by different ranks in a multi-GPU (multi-process) scenario.

### Config

| Field                                  | Type       | Default Value       |
| :------------------------------------- | :--------- | :------------------ |
| architecture                           | string     | mandatory           |
| dtype                                  | string     | mandatory           |
| logits_dtype                           | string     | 'float32'           |
| vocab_size                             | int        | mandatory           |
| max_position_embeddings                | int        | null                |
| hidden_size                            | int        | mandatory           |
| num_hidden_layers                      | int        | mandatory           |
| num_attention_heads                    | int        | mandatory           |
| num_key_value_heads                    | int        | num_attention_heads |
| hidden_act                             | string     | mandatory           |
| intermediate_size                      | int        | null                |
| norm_epsilon                           | float      | 1e-5                |
| position_embedding_type                | string     | 'learned_absolute'  |
| mapping.world_size                     | int        | 1                   |
| mapping.tp_size                        | int        | 1                   |
| mapping.pp_size                        | int        | 1                   |
| quantization.quant_algo                | str        | null                |
| quantization.kv_cache_quant_algo       | str        | null                |
| quantization.group_size                | int        | 64                  |
| quantization.has_zero_point            | bool       | False               |
| quantization.pre_quant_scale           | bool       | False               |
| quantization.exclude_modules           | list       | null                |

`mapping.world_size` means `mapping` is a dictionary containing the `world_size` sub field.

```json
{
    "architecture": "OPTForCausalLM",
    "mapping": {
        "world_size": 1
    }
}
```

Supported quantization algorithm list:

- W8A16
- W4A16
- W4A16_AWQ
- W4A8_AWQ
- W4A16_GPTQ
- FP8
- W8A8_SQ_PER_CHANNEL

Supported KV cache quantization algorithm list:

- FP8
- INT8

The config field is extensible, a model could add its own specific config fields.
For example, OPT model has a `do_layer_norm_before` field.

Here is the model specific config list:

| Field                                  | Type       | Default Value       |
| :------------------------------------- | :--------- | :------------------ |
| OPT                                    |            |                     |
| do_layer_norm_before                   | bool       | False               |
|                                        |            |                     |
| Falcon                                 |            |                     |
| bias                                   | bool       | True                |
| new_decoder_architecture               | bool       | False               |
| parallel_attention                     | bool       | False               |

### Rank Weights

Like PyTorch, the tensor (weight) name is a string containing hierarchical information,
which is uniquely mapped to a certain parameter of a TensorRT LLM model.

For example, each transformer layer of the OPT model contains an `Attention` layer, an `MLP` layer. and two `LayerNorm` layers.

#### Attention Weights

The `Attention` layer contains two `Linear` layers, qkv and dense; each `Linear` layer contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.attention.qkv.weight`
- `transformer.layers.0.attention.qkv.bias`
- `transformer.layers.0.attention.dense.weight`
- `transformer.layers.0.attention.dense.bias`

where `transformer.layers.0.attention` is the prefix name, indicating that the weights/biases are in the Attention module of the 0-th transformer layer.

#### MLP Weights

The `MLP` layer also contains two `Linear` layers, fc and proj; each `Linear` layer contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.mlp.fc.weight`
- `transformer.layers.0.mlp.fc.bias`
- `transformer.layers.0.mlp.proj.weight`
- `transformer.layers.0.mlp.proj.bias`

where `transformer.layers.0.mlp` is the prefix name, indicating that the weights/biases are in the MLP module of the 0-th transformer layer.

#### LayerNorm Weights

Each of the two `LayerNorm` layers, namely `input_layernorm` and `post_layernorm`, contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.input_layernorm.weight`
- `transformer.layers.0.input_layernorm.bias`
- `transformer.layers.0.post_layernorm.weight`
- `transformer.layers.0.post_layernorm.bias`

where `transformer.layers.0.input_layernorm` and `transformer.layers.0.post_layernorm` are prefix names for the two `layernorm` modules.

#### KV Cache Quantization Scaling Factors

If we quantize the model, there will be different tensors (depending on the quantization method applied).
For example, if we quantize the KV cache, the `Attention` layer will have this extra scaling factor:

- `transformer.layers.0.attention.kv_cache_scaling_factor`

#### FP8 Quantization Scaling Factors

Here is the FP8 scaling factors of `attention.qkv` linear layer:

- `transformer.layers.0.attention.qkv.activation_scaling_factor`
- `transformer.layers.0.attention.qkv.weights_scaling_factor`

#### AWQ Quantization Scaling Factors

Here is the AWQ scaling factors of `mlp.fc` linear layer:

- `transformer.layers.0.mlp.fc.weights_scaling_factor`
- `transformer.layers.0.mlp.fc.prequant_scaling_factor`

    ```{note}
    The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.

### Example

Let's take OPT as an example and deploy the model with tensor parallelism 2:

```bash
cd examples/opt
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --tp_size 2 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/
```

Here is the checkpoint directory:

```
./opt/125M/trt_ckpt/fp16/1-gpu/
    config.json
    rank0.safetensors
    rank1.safetensors
```

Here is the `config.json`:

```json
{
    "architecture": "OPTForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "vocab_size": 50272,
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
    "mapping": {
        "world_size": 2,
        "tp_size": 2
    },
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "do_layer_norm_before": true,
}
```

## Build Checkpoint into TensorRT Engine

TensorRT LLM provides a unified build command: `trtllm-build`. Before using it,
you may need to add it to the `PATH`.

```bash
export PATH=/usr/local/bin:$PATH

trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --gemm_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_seq_len 1024 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/
```

## Make Evaluation

```bash
mpirun -n 2 --allow-run-as-root \
    python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14
```

---

(core-concepts)=

# Model Definition

TensorRT-LLM has a Model Definition API that can be used to define
Large Language Models. This API is built on top of the powerful
[TensorRT Python API](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/index.html)
to create graph representations of deep neural networks in TensorRT. To become
familiar with the core concepts of the TensorRT API, refer to the
[Core Concepts](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/coreConcepts.html)
section of the TensorRT documentation before proceeding further.

In TensorRT-LLM, the [`tensorrt_llm.Builder`](source:tensorrt_llm/builder.py) class
contains a
[`tensorrt.Builder`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Builder.html#id1)
object. That instance is used in the `tensorrt_llm.Builder.create_network`
method to create an instance of the
[`tensorrt.INetworkDefinition`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Graph/Network.html#tensorrt.INetworkDefinition)
class. The `INetworkDefinition` object can then be populated using the free
functions defined in the
[`tensorrt_llm.functional`](source:tensorrt_llm/functional.py).

A simple example of such a free function is `tensorrt_llm.activation` that inserts a
[`tensorrt.IActivationLayer`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Graph/Layers.html#tensorrt.IActivationLayer)
node in the graph of the model:

```python
# In tensorrt_llm.functional:

def activation(input: Tensor, act_type: trt.ActivationType) -> Tensor:
    layer = default_trtnet().add_activation(input.trt_tensor, act_type)   # default_trtnet() -> INetworkDefinition
    return _create_tensor(layer.get_output(0), layer)
```

To make it even easier for users, a few of the most standard activation
functions found in LLMs are derived from that function:

```python
# In tensorrt_llm.functional:

relu    = partial(activation, act_type=trt.ActivationType.RELU)
sigmoid = partial(activation, act_type=trt.ActivationType.SIGMOID)

```

Specialized activation functions can be used to assemble more advanced
functions such as the `silu` activation:

```python
# In tensorrt_llm.functional:

def silu(input: Tensor) -> Tensor:
    return input * sigmoid(input)
```

When the TensorRT-LLM's Model Definition API is utilized, a graph of the network is
assembled.  The graph can later be traversed or transformed using the graph
traversal API exposed by the
[`tensorrt.ILayer`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Graph/LayerBase.html#tensorrt.ILayer)
class. That graph will also be optimized by TensorRT during the compilation of
the engine, as explained in the next section.

# Compilation

Once populated, the instance of the
[`tensorrt.INetworkDefinition`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Graph/Network.html#tensorrt.INetworkDefinition),
can be compiled into an efficient engine by the
[`tensorrt.Builder`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Builder.html#id1)
In TensorRT-LLM, it is done through the `build_engine` member function of the
`tensorrt_llm.Builder` class that calls the
[`build_serialized_network`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Builder.html#tensorrt.Builder.build_serialized_network
method of the
[`tensorrt.Builder`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Builder.html#id1)
object. That call, if everything works as expected, produces an instance of the
[`tensorrt.IHostMemory`](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/FoundationalTypes/HostMemory.html#tensorrt.IHostMemory)
class. That object is an optimized TensorRT engine that can be stored as a
binary file.

## TensorRT Compiler

The TensorRT compiler can sweep through the graph to choose the best kernel for each operation and available GPU. Crucially, it can also identify patterns in the graph where multiple operations are good candidates for being fused into a single kernel. This reduces the required amount of memory movement and the overhead of launching multiple GPU kernels.

TensorRT also compiles the graph of operations into a single [CUDA Graph](https://developer.nvidia.com/blog/cuda-graphs/) that can be launched all at one time, further reducing the kernel launch overhead.

The TensorRT compiler is extremely powerful for fusing layers and increasing execution speed, but there are some complex layer fusions—like [FlashAttention](https://arxiv.org/abs/2307.08691) — that involve interleaving many operations together and which can’t be automatically discovered. For those, you can explicitly replace parts of the graph with [plugins](#plugins) at compile time.

## Model Engine

The engine file contains the information that you need for executing the model, but LLM usage in practice requires much more than a single forward pass through the model. TensorRT-LLM includes a highly optimized C++ runtime for executing built LLM engines and managing processes like sampling tokens from the model output, managing the KV cache, and batching requests together.

You can use that runtime directly to execute the model locally, or you can use the TensorRT-LLM runtime backend for NVIDIA Triton Inference Server to serve the model for multiple users.

## Weight Bindings

TensorRT engines embed the network weights, that must be known for compilation.
For that reason, the weights must be bound to parameters in the model
definition before calling `tensorrt_llm.Builder.build_engine`. It leads to code like:

```python
# The Linear operator exposes two parameters (see tensorrt_llm/layers/linear.py):
class Linear(Module):
    def __init__(self, ...):
        self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
        self.bias   = Parameter(shape=(self.out_features, ), dtype=dtype)

# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:
tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value   = fromfile(...)
```

Note that TensorRT can also
[refit](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#refitting-engine-c)
engines to update the weights after compilation. This feature is available to
TensorRT-LLM users through the `refit_engine` method in the
`tensorrt_llm.Builder` class.

## Pattern-Matching and Fusion

One of the key steps performed by TensorRT when it compiles the network graph
is the fusion of operations. Fusion is a well-known technique to improve the
efficiency when executing LLMs. It helps reduce the amount of data transferred
between the memory (DRAM) and the compute cores (CUDA cores as well as Tensor
Cores located on the [Streaming
Multiprocessors](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction)
of a GPU). It also removes kernel launch overhead (each time a kernel is
launched on the GPU, there is a small additional CPU cost that is called the
launch overhead). A classical example is the fusion of the activation function
with the matrix multiplication (matmul) that usually precedes it in the
network.

In TensorRT-LLM, when defining the model, such a sequence can be written as:

```python
c = tensorrt_llm.functional.matmul(a, b)
c = tensorrt_llm.functional.relu(c)
```

During inference, if the above sequence is executed without fusion, the `c`
tensor has to be written to global memory at the end of the `matmul`, read from
that same memory in `relu` and written again after `relu`. If no other
operation uses the intermediate values between `matmul` and `relu`, it is
suboptimal.  That is why, during compilation, TensorRT will identify that
pattern and automatically produce a GPU kernel that applies `relu` at the end
of `matmul` without an intermediate step through global memory. With that
optimization, the `c` tensor is written only once (after `relu`) instead of
twice, and is not read between the two operations.

The process of identifying the sequences of operations that can be fused is
called _pattern-matching_. TensorRT has a powerful pattern-matching algorithm
that can identify a lot of possible fusions. All the identified patterns are
converted into more efficient kernels by an advanced kernel compiler.

## Plugins

The number of possible fusions is almost infinite and some useful fusions
involve very advanced modifications of the graph. A well-known example
is the [Flash-Attention](https://arxiv.org/abs/2205.14135) technique to
optimize the [Multihead-Attention](https://arxiv.org/abs/1706.03762) block
found in many LLMs. Flash-Attention requires modifications to the arithmetic
performed in the sequence `BMM-Softmax-BMM` (where `BMM` stands for Batched
Matrix-Matrix product) and the interleaving of the `for`-loops of the two
batched matrix products.  That's non-trivial and not necessarily something
you can expect a compiler to "discover" on its own (or it might require the
support for a [polyhedral
model](https://en.wikipedia.org/wiki/Polytope_model)).

As a result, even if TensorRT has a powerful pattern-matching algorithm and
supports a lot of possible fusions, there is always the risk that it cannot
identify uncommon and/or very advanced patterns. To overcome that inevitable
limitation, TensorRT offers a powerful mechanism known as
[plugins](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Plugin/pyPlugin.html).

The plugins are nodes inserted in the network graph definition that map to user-defined
GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in
the [`cpp/tensorrt_llm/plugins`](source:/cpp/tensorrt_llm/plugins) directory.

Plugins are written in C++ and follow a well-defined interface described in the
[Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/extending-custom-layers.html)
section of the TensorRT
[Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html).
When executed within a TensorRT engine, plugins trigger the execution of
their encapsulated GPU kernels. A fairly simple example of plugins is the
[`QuantizeTensorPlugin`](source:/cpp/tensorrt_llm/plugins/quantizeTensorPlugin) that
triggers a CUDA kernel in the `QuantizeTensorPlugin::enqueue` member function:

```cpp
// In cpp/tensorrt_llm/plugins/quantizeTensorPlugin/quantizeTensorPlugin.cpp:

int QuantizeTensorPlugin::enqueue(...) {
    if (inputDesc[0].type == DataType::kFLOAT) {
        invokeQuantization<float>(...);
    } else {
        invokeQuantization<half>(...);
    }
    return 0;
}

// In cpp/tensorrt_llm/kernels/quantization.cu:

template <typename T>
void invokeQuantization(...) {
    // The standard <<< >>> construct to launch CUDA kernels
    quantizedKernel<<<grid, block, 0, stream>>>(...);
}
```

For more details on how TensorRT-LLM implements the GPT Attention operator, see
the [Multi-head, Multi-query and Group-query Attention](../advanced/gpt-attention.md) document.

# Runtime

TensorRT-LLM includes an API to implement Python and C++ runtimes. The role of
the runtime components is to load the TensorRT engines and drive their
execution. Typically, for an auto-regressive model like GPT, the runtime is in
charge of loading the engine that implements both the processing of the input
sequence as well as the body of the generation loop. See the [GPT C++
Runtime](../advanced/gpt-runtime.md) document for details on the C++ Runtime.

(multi-gpu-multi-node)=

# Multi-GPU and Multi-Node Support

Even if TensorRT is designed for single-GPU systems, TensorRT-LLM adds the
support for systems with multiple GPUs and nodes. It is enabled
using TensorRT plugins that wrap communication primitives from the
[NCCL](https://developer.nvidia.com/nccl) library as well as a custom
plugin that optimize the All-Reduce primitive in the presence of All-to-all
connections between GPUs (through NVSwitch in DGX systems).

The communication plugins can be found in
[cpp/tensorrt_llm/plugins/ncclPlugin](source:cpp/tensorrt_llm/plugins/ncclPlugin)
and the multi-GPU functions are exposed in the TensorRT-LLM Model Definition API
as:

```python
# In tensorrt_llm/functional.py:

# Collectives.
def allreduce(tensor: Tensor, group: List[int]) -> Tensor
def allgather(tensor: Tensor, group: List[int], gather_dim: int = 0) -> Tensor

# Point-to-point communication primitives.
def send(tensor: Tensor, tgt: int) -> Tensor
def recv(tensor: Tensor, src: int) -> Tensor
```

The multi-GPU support can be enabled through two different modes of model
parallelism: Tensor Parallelism and Pipeline Parallelism. The former mode
splits the different layers of a model across the GPUs. Each GPU runs the
entire network and synchronizes with its siblings when needed. The Pipeline
Parallelism distributes the different layers to the GPUs. Each GPU runs a
subset of the entire model and communications happen at the boundary of those
subsets of layers. Tensor Parallelism usually leads to more balanced executions
but requires more memory bandwidth between the GPUs. Pipeline Parallelism
reduces the need for high-bandwidth communication but may incur load-balancing
issues and may be less efficient in terms of GPU utilization.

## Examples

Here are examples of Llama 3.1 70B and Llama 3.1 405B showing how to perform multi-GPU and multi-node inference in TensorRT-LLM. The example of Llama 3.1 70B performs multi-GPU inference on a single node, while the example of Llama 3.1 405B performs multi-node inference.

### Llama 3.1 70B

The following sample commands build an engine for running the Llama 3.1 70B model with tensor parallelism (TP=4) using 4 GPUs on a single node.

```bash
folder_trt_llm=../TensorRT-LLM
model_dir=Llama-3.1-70B
ckpt_dir=ckpt_llama_3.1_70b
engine_dir=engine_llama_3.1_70b
dtype=bfloat16
tp_size=4
pp_size=1
kv_cache_type=paged
max_input_len=128
max_output_len=128
max_batch_size=4
workers=$(( tp_size * pp_size ))

python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
    --output_dir ${ckpt_dir} \
    --model_dir ${model_dir} \
    --dtype ${dtype} \
    --tp_size ${tp_size} \
    --pp_size ${pp_size} \
    --workers ${workers} \
    --use_parallel_embedding

trtllm-build \
    --output_dir ${engine_dir} \
    --checkpoint_dir ${ckpt_dir} \
    --gemm_plugin ${dtype} \
    --gpt_attention_plugin ${dtype} \
    --kv_cache_type ${kv_cache_type} \
    --max_input_len ${max_input_len} \
    --max_seq_len $(( max_input_len + max_output_len )) \
    --max_batch_size ${max_batch_size} \
    --workers ${workers}
```

The following sample commands perform inference using 4 GPUs on a single node by running `examples/run.py`.

```bash
input_text="Born in north-east France, Soyer trained as a"

mpirun -n $(( tp_size * pp_size )) \
    python ${folder_trt_llm}/examples/run.py \
        --engine_dir ${engine_dir} \
        --tokenizer_dir ${model_dir} \
        --input_text "${input_text}" \
        --max_output_len ${max_output_len}
```

### Llama 3.1 405B

The following sample commands build an engine for running the Llama 3.1 405B model with tensor parallelism (TP=16) on 2 nodes that each have 8 GPUs. Although the model runs on multiple nodes, you can build the engine on a single node.

```bash
folder_trt_llm=../TensorRT-LLM
model_dir=Llama-3.1-405B
ckpt_dir=ckpt_llama_3.1_405b
engine_dir=engine_llama_3.1_405b
dtype=bfloat16
tp_size=16
pp_size=1
kv_cache_type=paged
max_input_len=128
max_output_len=128
max_batch_size=4
workers=8

python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
    --output_dir ${ckpt_dir} \
    --model_dir ${model_dir} \
    --dtype ${dtype} \
    --tp_size ${tp_size} \
    --pp_size ${pp_size} \
    --workers ${workers} \
    --use_parallel_embedding

trtllm-build \
    --output_dir ${engine_dir} \
    --checkpoint_dir ${ckpt_dir} \
    --gemm_plugin ${dtype} \
    --gpt_attention_plugin ${dtype} \
    --kv_cache_type ${kv_cache_type} \
    --max_input_len ${max_input_len} \
    --max_seq_len $(( max_input_len + max_output_len )) \
    --max_batch_size ${max_batch_size} \
    --workers ${workers}
```

The following sample script, `launch_llama_3.1_405b.sh`, shows how to perform inference with Slurm on 2 nodes that each have 8 GPUs. If you use a different workload management software, the key concern is to run the `examples/run.py` command.

```bash
#!/bin/bash
#SBATCH --account account
#SBATCH --partition partition
#SBATCH --job-name job-name
#SBATCH --time 1:00:00
#SBATCH --nodes 2

folder_trt_llm=../TensorRT-LLM
engine_dir=engine_llama_3.1_405b
model_dir=Llama-3.1-405B
max_output_len=128

input_text="Born in north-east France, Soyer trained as a"

srun \
    --ntasks-per-node 8 \
    --mpi pmix \
    python ${folder_trt_llm}/examples/run.py \
        --engine_dir ${engine_dir} \
        --tokenizer_dir ${model_dir} \
        --input_text "${input_text}" \
        --max_output_len ${max_output_len}
```

You can perform inference by running the script on the Slurm cluster.

```bash
sbatch launch_llama_3.1_405b.sh
```

---

# TensorRT-LLM Model Weights Loader

## Overview

The weights loader is designed for easily converting and loading external weight checkpoints into TensorRT-LLM models.

## Workflow

Weight checkpoints can be generated from all sources, and may have different naming and data layouts compared to TRT-LLM's requirements. E.g.:

```bash
# HuggingFace LLaMA checkpoints
{
    "model.embed_tokens.weight": torch.Tensor([vocab_size, hidden_size])
    "model.layers.0.input_layernorm.weight": torch.Tensor([hidden_size]),
    "model.layers.0.mlp.down_proj.weight": torch.Tensor([hidden_size, inter_size]),
    "model.layers.0.mlp.gate_proj.weight": torch.Tensor([inter_size, hidden_size]),
    "model.layers.0.mlp.up_proj.weight": torch.Tensor([inter_size, hidden_size]),
    "model.layers.0.post_attention_layernorm.weight": torch.Tensor([hidden_size]),
    "model.layers.0.self_attn.q_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.k_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.v_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.o_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    ...,
}
# TensorRT-LLM expected weights
{
    "transformer.vocab_embedding.weight": torch.Tensor([vocab_size, hidden_size])
    "transformer.layers.0.input_layernorm.weight": torch.Tensor([hidden_size]),
    "transformer.layers.0.mlp.down_proj.weight": torch.Tensor([hidden_size, inter_size]),
    "transformer.layers.0.mlp.gate_proj.weight": torch.Tensor([inter_size, hidden_size]),
    "transformer.layers.0.mlp.up_proj.weight": torch.Tensor([inter_size, hidden_size]),
    "transformer.layers.0.post_layernorm.weight": torch.Tensor([hidden_size]),
    "transformer.layers.0.attention.qkv.weight": torch.Tensor([hidden_size * 3, hidden_size]), # Different layout
    "transformer.layers.0.attention.dense.weight": torch.Tensor([hidden_size, hidden_size]),
    ...,
}
```

Conversion means converting the dictionary of `{external_keys:external_weights}` into `{tllm_keys:tllm_weights}`, it includes changing the naming logic and data layouts, and is contains of the following parts:

1. Translate a TRT-LLM parameter name into external-format name(s).
2. Loading tensor slice(s) according to the translated names.
3. Postprocess the tensor(s) into target layout.

### Translator

TRT-LLM parameter names are translated in units of sections divided by dots. E.g.:

|    TensorRT-LLM key     | `transformer` |.| `layers` |.| `0` |.| `attention` |.|  `dense` |.| `weight` |
| :---------------------: | :-----------: |-| :------: |-|:---:|-| :---------: |-| :------: |-| :------: |
| Translated external key |    `model`    |.| `layers` |.| `0` |.| `self_attn` |.| `o_proj` |.| `weight` |

The mapping between TRT-LLM keywords and HF keywords are described in `tllm_to_externel_key_dict` of `ModelWeightsLoader` class object. \
If any of the mappings has one-to-multiple corresponding, the translated key will get multiplied accordingly. E.g.:

|         TensorRT-LLM key and related keyword mapping         | Translated external keys |
| :----------------------------------------------------------: | :----------------------: |
| `transformer.layers.0.attention.qkv.weight`<br>`{"qkv":[q_proj, k_proj, v_proj]}` | `model.layers.0.self_attn.q_proj.weights`<br>`model.layers.0.self_attn.k_proj.weights`<br>`model.layers.0.self_attn.v_proj.weights`|
|   `transformer.layers.0.mlp.fc.weight`<br>`{"weight":[qweight, qzeros, scales]}`  | `model.layers.0.mlp.gate_proj.qweight`<br>`model.layers.0.mlp.gate_proj.qzeros`<br>`model.layers.0.mlp.gate_proj.scales`|

The default `tllm_to_externel_key_dict` is based on HF LLaMA as:

```python
class ModelWeightsLoader:
    def __init__(self, model_dir, customized_key_dict: dict = {}) -> None:
        ...
        self.tllm_to_externel_key_dict = {
            "transformer": "model",
            "vocab_embedding": "embed_tokens",
            "lm_head": "lm_head",
            "ln_f": "norm",
            "attention": "self_attn",
            "qkv": ["q_proj", "k_proj", "v_proj"],
            "dense": "o_proj",
            "gate": "up_proj",
            "proj": "down_proj",
            "fc": "gate_proj",
            "input_layernorm": "input_layernorm",
            "post_layernorm": "post_attention_layernorm",
        }
        self.tllm_to_externel_key_dict.update(customized_key_dict)
        ...
```

It can be updated through passing `customized_key_dict` when initializing `ModelWeightsLoader`.

The dictionary will also get updated according to the layer classes. When iterating over parameters,
if the layer class has attribute `tllm_to_externel_key_dict`, for keywords exist both in the default one and the layer-specified one,
the weight loader will translate according to the layer attribute with higher priority.
This can enable the support for different quantization precisions automatically.


### Loading function

The loading function can load an arbitrary tensor slice according to its `key`, `tp_size`, `tp_dim` and `tp_rank`.

The template for loading function is as following.

```python
def load_tensor(self, key, tp_size, tp_dim, tp_rank):
    # Retrieve file pointer index
    if key in self.shard_map:
        ptr_idx = self.shard_map[key]
    else:
        return None

    # Load tensor from the corresponding shard
    if self.format == ModelWeightsFormat.SAFETENSORS:
        tensor = self.shards[ptr_idx].get_slice(key)
        tensor_shape = tensor.get_shape()
    else:
        ...

    # Shard and return a tensor slice
    slice_shape = ...
    return tensor[slice_shape]
```

When initializing the `ModelWeightsLoader` object, the file format will be derived from `model_dir` through `detect_format`. The following formats are supported for now:

 * Directory contains or file named `*.safetensors` (Recommended, has better performance)
 * Directory contains or file named `*.bin`
 * Directory contains or file named `*.pth`

To support other formats or in-memory loaded models, the format need to be claimed in `ModelWeightsFormat`, `detect_format()`, `preload()` and `load_tensor()`.

### Postprocessing functions

After translation and loading, a TRT-LLM key will become a tensor or a list of tensors, which is the input of postprocessing functions. \
Operations including QKV concatenating, MoE weight stacking and weight-only quantization can be handled here.
The template of postprocessing function is:

```python
# Example for 1-1 weights mapping
class CustomizedModuleA(Module):
    def __init__(...):
        super().__init__(...)
        ...
        self.tp_dim = 0    # Need to set or inherit from parent class

    def postprocess(self, tllm_key, weights, **kwargs):
        weights = proc(weights)
        return {tllm_key: weights}

# Example for multiple-multiple weights mapping
class CustomizedModuleB(Module):
    def __init__(...):
        super().__init__(...)
        ...
        self.tp_dim = 0    # Need to set or inherit from parent class
        # The default value of "weights" in tllm_to_externel_key_dict will be override
        self.tllm_to_externel_key_dict = {"weight": ["qweight", "qzeros", "scales"]}

    def postprocess(self, tllm_key, weights, **kwargs):
        # Skipped the postprocess of zeros and weights_scaling_factor
        # They are loaded in the postprocess of weight
        config = kwargs.get("config", None) # Passed through kwargs by default
        if not tllm_key.endswith("weight"):
            return {}
        # The order in weights is defined in tllm_to_externel_key_dict
        qweight, qzeros, scales = weights
        proccessed_weight, proccessed_zeros = proc(qweight, qzeros, config.num_heads)
        return {
            tllm_key: proccessed_weight,
            tllm_key.replace("weight", "zeros"): proccessed_zeros,
            tllm_key.replace("weight", "weights_scaling_factor"): scales,
        }
```

## Examples

The `ModelWeightsLoader` class can support different models with the following levels:

### Natively supported models
For models with native support, users can call the default weight loader without any other operations.
```python
# Using the model weights loader for LLaMA
from tensorrt_llm.models.model_weights_loader import ModelWeightsLoader
loader = ModelWeightsLoader(external_checkpoint_dir)
loader.generate_tllm_weights(trtllm_model)
```
For calibration-free quantization precisions, passing a properly quantized `trtllm_model` will let the weight loader load at the given precision accordingly. The configurations will be read from `trtllm_model.config` automatically. For now, LLaMA family models using the default `tllm_to_externel_key_dict` is supported natively.

### Models with customized key names
For models with different naming logic, users can still call the default weight loader with `customized_key_dict` specified.
```python
# Using the model weights loader for the LLM part of LLaVA
from tensorrt_llm.models.model_weights_loader import ModelWeightsLoader
llava_dict = {
    "transformer": "language_model.model",
    "lm_head": "language_model.lm_head"
}
loader = ModelWeightsLoader(external_checkpoint_dir, llava_dict)
loader.generate_tllm_weights(trtllm_model)
```
Users need to specify the different part from the default `tllm_to_externel_key_dict`. The loader still have support across different precisions.
The support for LLaVA and Exaone is in `LLaMAForCausalLM.from_hugging_face()` of [model.py](../../../tensorrt_llm/models/llama/model.py), and can also be taken as examples.

### Models with customized weight layout
For models with different weight layout, users can write the conversion loop explicitly and do customized operations.
```python
# Using the model weights loader for BLOOM
from tensorrt_llm.models.model_weights_loader import ModelWeightsLoader
bloom_dict = {
    "transformer": "",
    "layers": "h",
    "ln_f": "ln_f",
    "lm_head": "word_embeddings",
    "ln_embed": "word_embeddings_layernorm",
    "vocab_embedding": "word_embeddings",
    "attention": "self_attention",
    "qkv": "query_key_value",
    "dense": "dense",
    "fc": "dense_h_to_4h",
    "proj": "dense_4h_to_h",
    "post_layernorm": "post_attention_layernorm",
}
loader = ModelWeightsLoader(external_checkpoint_dir, bloom_dict)
# See ModelWeightsLoader.generate_tllm_weights()
loader.update_key_mapping(trtllm_model)
tllm_weights = {}
for tllm_key, _ in tqdm(trtllm_model.named_parameters()):
    if tllm_key.endswith("qkv"):
        # Passing the callable handle
        tllm_weights.update(loader.load(tllm_key, preprocess=customized_preprocess))
    else:
        tllm_weights.update(loader.load(tllm_key))
loader.fill(tllm_weights)
```
This will apply `preprocess` after `load_tensor()` and before `postprocess`, and demonstrates how to convert the loaded shard into default HF layout. The loader still have support for precisions quantized from FP16/BF16 (e.g. INT8-wo/INT4-wo), the other precisions may require special operations, and can be addressed inside the `preprocess` function.
The support for Qwen-1 is in `QWenForCausalLM.from_hugging_face()` of [model.py](../../../tensorrt_llm/models/qwen/model.py), and can also be taken as example.

### Fully customized
If the model weights loader cannot satisfy the requirements, users can write the conversion loop totally on their own.
```python
tllm_weights = {}
for tllm_key, param in tqdm(trtllm_model.named_parameters()):
    # Load from external checkpoints
    # The load_tensor() function can also be called here
    tensor = ...
    # Convert tensor and set the values according to the config
    if trtllm_model.config.quantization.quant_algo == xxx:
        ...
    else:
        ...
    param.value = tensor
```
In this mode, every precision require user's own support.

## Trouble shooting
The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.

If users are encountered with failure caused by `ModelWeightsLoader`, a workaround is passing environmental variable `TRTLLM_DISABLE_UNIFIED_CONVERTER=1` to disable the model weights loader and fallback to the legacy path.

This workaround will be removed in future version after the LLaMA/Qwen weights conversion is stable.

---

# TensorRT-LLM Build Workflow

## Overview


The build workflow contains two major steps.

1. Create TensorRT-LLM models from existing model checkpoints exported by the training framework.
2. Build the TensorRT-LLM models to TensorRT-LLM engines.

To generalize the TensorRT-LLM optimization features to all models, and to share the same workflow between different models for TensorRT-LLM users, TensorRT-LLM has conventions about how the models shall be defined and how the models shall be imported.

TensorRT-LLM checkpoint convention is documented in [](checkpoint.md) and all decoder-only models had been migrated to adopt the convention. Model-specific convert_checkpoint.py scripts are shipped as source code in example directories, and a trtllm-build CLI tool had been added. However, there are some disadvantages of providing convert checkpoint scripts outside the core TensorRT-LLM lib as example:

1. TensorRT-LLM evolves so quickly that the model's definition code might have changed for better performance; which means the `convert_checkpoint.py` is out of date.


2. TensorRT-LLM is creating a new set of high-level APIs which handle model conversion, engine building, and inference in one class for easier-of-use. Thus, the high-level APIs need to call the weights conversion code, which shall be part of TensorRT-LLM core lib, not the example. And the conversion code of different models shall have same interface such that the high-level APIs do not need to add many ad-hoc code for different models.

To mitigate these issues, the model specific `convert_checkpoint.py` scripts are being refactored. Most of the conversion code will be moved into core lib, sitting next to the model definition. Refer to `tensorrt_llm/models/llama/` as an example. There is a new set of APIs for importing models and converting weights. The 0.9 release refactored the LLaMA model class to adopt the new APIs, others models' refactor work is ongoing.


## Conversion APIs

The API for weight conversion of the LLaMA model looks like this. A `TopModelMixin` class is introduced, `from_hugging_face()` interface is declared, the `LLaMAForCausalLM` class inherits `TopModelMixin` (not direct parent class, but in its base class hierarchy), and implements the interface.

```python
class TopModelMixin
    @classmethod
    def from_hugging_face(cls,
                          hf_model_dir: str,
                          dtype: Optional[str] = 'float16',
                          mapping: Optional[Mapping] = None,
                          **kwargs):
        raise NotImplementedError("Subclass shall override this")

# TopModelMixin is in the part of base class hierarchy
class LLaMAForCausalLM (DecoderModelForCausalLM):
    @classmethod
    def from_hugging_face(cls,
             hf_model_dir,
             dtype='float16',
             mapping: Optional[Mapping] = None) -> LLaMAForCausalLM:
        # creating a TensorRT-LLM llama model object
        # converting HuggingFace checkpoint to TensorRT-LLM expected weights dict
        # Load the weights to llama model object
```


Then, in the convert_checkpoint.py script in the
[`examples/models/core/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/) directory of the GitHub repo,
the logic can be greatly simplified. Even if the model definition code of TensorRT-LLM LLaMA class is changed due to some reason, the `from_hugging_face` API will keep the same, thus the existing workflow using this interface will not be affected.


```python
#other args omitted for simplicity here.
llama = LLaMAForCausalLM.from_hugging_face(model_dir, dtype, mapping=mapping)
llama.save_checkpoint(output_dir, save_config=(rank==0))
```

The `from_hugging_face` API does not save the checkpoint into disk intentionally, instead it returns an in-memory object. Call `save_checkpoint` to save the models. This keeps the flexibility and makes the flow of convert->build in one process faster. Typically, saving and loading disk for large models are slower and thus should be avoided.


Since LLaMA models were also released with different formats, such as the Meta checkpoint, the `LLaMAForCausalLM` class has a `from_meta_ckpt` function for that. This function is not declared in the `TopModelMixin` class due to it being LLaMA specific, and therefore, other models don't use it.


In the 0.9 release, only LLaMA is refactored. Since popular LLaMA (and its variants) models are released by Hugging Face and Meta checkpoint formats, only these two functions are implemented.


In future releases, there might be `from_jax`, `from_nemo`, `from_keras` or other factory methods for different training checkpoints added.
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/models/core/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma/)
directory support JAX and Keras formats in addition to Hugging Face. The model developers can choose to implement **any subset** of these factory methods for the models they contributed to TensorRT-LLM.


For some formats which are not supported by TensorRT-LLM model developers, you still have the freedom to implement your own weights conversion outside the core lib; the flow will look like this:


```python
config = read_config_from_the_custom_training_checkpoint(model_dir)
llama = LLaMAForCausalLM(config)

# option 1:
# Create a weights dict and then calls LLaMAForCausalLM.load
weights_dict = convert_weights_from_custom_training_checkpoint(model_dir)
llama.load(weights_dict)

# option 2:
# Internally assign the model parameters directly
convert_and_load_weights_into_trtllm_llama(llama, model_dir)
# Use the llama object as usual, to save the checkpoint or build engines
```

Though there are some limitations and pitfalls of doing these custom weights loading, if the model definition is inside TensorRT-LLM core lib, and the weights loading/conversion are outside the core lib, the conversion code might need to be updated when new TensorRT-LLM is released.


## Quantization APIs

TensorRT-LLM relies on NVIDIA Modelopt toolkit to support some of the quantization like: FP8, W4A16_AWQ, W4A8_AWQ, while it also has some its own quantization implementation for Smooth Quant, INT8 KV cache, and INT4/INT8 weight only.


In TensorRT-LLM 0.8 version:

* For Modelopt-supported quantization algorithms, a standalone script,
  [example/quantization/quantize.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py)
  can export TensorRT-LLM checkpoints, and the trtllm-build command needs to be executed to build the checkpoints to engines.

* For the non-Modelopt quantization algorithms, users need to use the per-model convert_checkpoint.py scripts to export TensorRT-LLM checkpoints.

Use the `quantize()` interface to unify the different quantization flows. The default implementation is added in the `PretrainedModel` class.


```python
class PretrainedModel:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        # Internally quantize the given hugging face models using Modelopt
        # and save the checkpoint to output_dir
```

* The default implementation only handles the Modelopt supported quantization. The LLaMA class then inherits this `PretrainedModel` and dispatches the Modelopt quantization to the super class's default implementation.
* The model developer raises errors in the sub-class implementation if the new model is not supported by Modelopt yet.


```python
class LLaMAForCausalLM:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantiConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        use_modelopt_quantization = ... # determine if to use Modelopt or use native
        if use_modelopt_quantization:
            super().quantize(hf_model_dir,
                             output_dir,
                             quant_config)
        else:
            # handles TensorRT-LLM native model specific quantization
            # or raise exceptions if not supported
```


The `quantize` API is designed to take multi-GPU resources internally to make quantization. For example, a LLaMA 70B BF16 takes 140G memory, if we make FP8 quantization, then, another 70G is needed. So, we need at least 210G, 4 * A100(H100) is needed to quantize the LLaMA 70B model. If you want to call `quantize` API inside a MPI program, be cautious and ensure the quantize API is only called by rank 0.


Usage of the `quantize` API in an MPI program looks like this, only rank 0 calls it. In an non-MPI program, the `if rank == 0` and the `mpi_barrier()` are not needed.

```python
quant_config = QuantConfig()
quant_config.quant_algo = quant_mode.W4A16_AWQ
mapping = Mapping(world_size=tp_size, tp_size=tp_size)
if rank == 0:
    LLaMAForCausalLM.quantize(hf_model_dir,
                          checkpoint_dir,
                          quant_config=quant_config)
mpi_barrier() # wait for rank-o finishes the quantization
llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir, rank)
engine = build(llama, build_config)
engine.save(engine_dir)
```


The `examples/quantization/quantize.py` is kept for backward compatibility.


## Build APIs


The `tensorrt_llm.build` API builds the TensorRT-LLM model object to TensorRT-LLM engine. This new API replaced the older flow: creating a builder, creating a network object, tracing the model to the network, and building TensorRT engines.
The usage of this API looks like this:

```python
llama = ... # create LLaMAForCausalLM object
build_config = BuildConfig(max_batch_size=1)
engine = tensorrt_llm.build(llama, build_config)
engine.save(engine_dir)
```


The Llama object can be created by any method mentioned in the [](#conversion-apis) or [](#quantization-apis) sections.


The `trtllm-build` CLI tool is a thin wrapper around this `tensorrt_llm.build` API. The flags of the CLI tool are kept close to the fields of the `BuildConfig` class.


If a model were to be saved into disk and then built to the engine later, TensorRT-LLM provides a `from_checkpoint` API to deserialize the checkpoint.

```python
## TensorRT-LLM code
class PretrainedModel:
    @classmethod
    def from_checkpoint(cls,
                    ckpt_dir: str,
                    rank: int = 0,
                    config: PretrainedConfig = None):
        # Internally load the model weights from a given checkpoint directory
```


The `from_checkpoint` API is called to deserialize the checkpoint to a model object.  The `tensorrt_llm.build` API can be called to build the engine.


```python
llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir)
engine = build(llama, build_config)
engine.save(engine_dir)
```

## CLI Tools

All the weights conversion, quantization, and build APIs mentioned above have corresponding CLI tools for convenience.

* Model specific `convert_checkpoint.py` scripts are inside the `examples/<model xxx>/` folder.
* A unified quantization script is inside the `examples/quantization/quantize.py` and can be shared by all **supported** models.
* A `trtllm-build` CLI tool builds all models from TensorRT-LLM checkpoint.

Refer to the following considerations for the CLI tools:

* These scripts and tools should be used for scripting. Do not import the Python functions/class defined in these tools. TensorRT-LLM does not promise the content of these scripts can be compatible with previous versions. The options of these tools may also be changed when it’s not avoidable.

* These scripts in the example folder may use TensorRT-LLM internal/unstable APIs, which is not guaranteed to work if the examples’ version and the TensorRT-LLM install version are mismatched. There are some GitHub issues caused by version mismatch.
    - https://github.com/NVIDIA/TensorRT-LLM/issues/1293
    - https://github.com/NVIDIA/TensorRT-LLM/issues/1252
    - https://github.com/NVIDIA/TensorRT-LLM/issues/1079

    You should always install the same TensorRT-LLM version specified in `examples/<model xxx>/requirements.txt`.

* In the future, the per-model conversion script may or may not be unified to one single script shared by models, given the nature of different models’ attributes may be different. However, the TensorRT-LLM team will try to make sure the flags for the same feature are consistent between different scripts.

* The TensorRT-LLM team encourages use of the new low-level conversion/quantization/build API instead of these scripts. The conversion APIs will be added model-by-model gradually, which may span a few releases.

---

(build-image-to-dockerhub)=

# Build the TensorRT LLM Docker Image
When you develop trt-llm on cloud platform such as runpod, you may need to provide a docker image for the platform. So you firstly need to upload the image to dockerhub.

## Build the TensorRT LLM Docker Image and Upload to DockerHub

```bash
make -C docker build
```
Then we can get the docker image named `tensorrt_llm/devel:latest`

### Enable ssh access to the container
Since the default docker image doesn’t have ssh support, we can’t ssh into it. We need to add ssh support to the container.
Let’s first create a new Dockerfile with below content:

```Dockerfile
FROM tensorrt_llm/devel:latest

RUN apt update && apt install openssh-server -y
RUN mkdir -p /run/sshd && chmod 755 /run/sshd
RUN mkdir -p /root/.ssh && chmod 700 /root/.ssh && touch /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys
# add sshd to entrypoint script
RUN echo "sshd -E /opt/sshd.log" >> /opt/nvidia/entrypoint.d/99-start-sshd.sh
```

If we save this Dockerfile as `Dockerfile.ssh`. Then we can build the docker image with below command:

```bash
docker build -t tensorrt_llm/devel:with_ssh -f Dockerfile.ssh .
```

Then we can get the docker image named `tensorrt_llm/devel:with_ssh`

## Upload the Docker Image to DockerHub

You need to register a [dockerhub](https://hub.docker.com) account first if you don't have one.

Then you can click 'Personal Access Tokens' in the user menu and create a new token.

With the token, you can login to dockerhub with below command:

```bash
docker login -u <your_dockerhub_username>
```

Enter the token to the console.

After login, you can tag and push the docker image to dockerhub with below command:

```bash
docker tag tensorrt_llm/devel:with_ssh <your_dockerhub_username>/tensorrt_llm:devel
docker push <your_dockerhub_username>/tensorrt_llm:devel
```

Finally, you can see the docker image in your dockerhub repository and can use it with the link such as `docker.io/<your_dockerhub_username>/tensorrt_llm:devel`.

---

(dev-on-runpod)=

# Develop TensorRT LLM on Runpod
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT LLM on Runpod.

## Prepare

### Create a Runpod account
Please refer to the [Runpod Getting Started](https://docs.runpod.io/get-started/).

### Configure SSH Key
Please refer to the [Configure SSH Key](https://docs.runpod.io/pods/configuration/use-ssh).

Note that we can skip the step of "Start your Pod. Make sure of the following things" here as we will introduce it below.

## Build the TensorRT LLM Docker Image and Upload to DockerHub
Please refer to the [Build Image to DockerHub](build-image-to-dockerhub.md).

Note that the docker image must enable ssh access. See on [Enable ssh access to the container](build-image-to-dockerhub.md#enable-ssh-access-to-the-container).

## Create a Pod Template
Click "Template" bottom on the menus and click "Create Template" bottom.

Fill the docker image link of DockerHub such as `docker.io/<your_dockerhub_username>/tensorrt_llm:devel` on "Docker Image" field.

Fill "22" into "Expose TCP Ports" field.

Fill
```bash
sleep infinity
```
into 'Container Start Command' field.

## Connect to the Pod
Please refer to the [Connect to the Pod](https://docs.runpod.io/pods/connect-to-a-pod).

You can connect the pod with SSH or Web Terminal.

If you want to connect the pod with SSH, you can copy the command from "SSH over exposed TCP" field and run it on your host.

In some scenarios such as using a team account, your public key has not been added to the pod successfully. You can directly add this command to the 'Container Start Command' field as:

```bash
bash -c 'echo "<your_public_key>" >> ~/.ssh/authorized_keys;sleep infinity'
```

Enjoy your development!

---

# Key Features

This document lists key features supported in TensorRT-LLM.

- [Quantization](../source/reference/precision.md)
- [Inflight Batching](../source/advanced/gpt-attention.md#in-flight-batching)
- [Chunked Context](../source/advanced/gpt-attention.md#chunked-context)
- [LoRA](../source/advanced/lora.md)
- [KV Cache Reuse](../source/advanced/kv-cache-reuse.md)
- [Speculative Sampling](../source/advanced/speculative-decoding.md)

---

(perf-analysis)=

# Performance Analysis

NVIDIA Nsight Systems reports at the application level are highly informative. Metric sampling capabilities have increased over generations and provide a clean middle-ground between timing analysis and kernel-level deep dives with NVIDIA Nsight Compute.

Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, we have added features to TensorRT-LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.


## Feature Descriptions

The main functionality here:
  * Relies on toggling the CUDA profiler runtime API on and off.
  * (PyTorch workflow only) Toggling the PyTorch profiler on and off.
  * Provides a means to understand which regions a user may want to focus on.

Toggling the CUDA profiler runtime API on and off:
  * Allows users to know specifically what the profiled region corresponds to.
  * Results in smaller files to post-process (for metric extraction or similar).

(PyTorch workflow only) Toggling the PyTorch profiler on and off:
  * Help users to analysis the performance breakdown in the model.
  * Results in smaller files to post-process (for metric extraction or similar).


## Coordinating with NVIDIA Nsight Systems Launch

Consult the Nsight Systems User Guide for full overview of options.

On the PyTorch workflow, basic NVTX markers are by default provided. On the C++/TensorRT workflow, append `--nvtx` when calling `scripts/build_wheel.py` script to compile, and clean build the code.

### Only collect specific iterations

To reduce the Nsight Systems profile size, and to control that only specific iterations are collected, set environment variable `TLLM_PROFILE_START_STOP=A-B`, and append `-c cudaProfilerApi` to `nsys profile` command.


### Enable more NVTX markers for debugging
Set environment variable `TLLM_NVTX_DEBUG=1`.

### Enable garbage collection (GC) NVTX markers
Set environment variable `TLLM_PROFILE_RECORD_GC=1`.


### Enable GIL information in NVTX markers
Append “python-gil” to Nsys “-t” option.


## Coordinating with PyTorch profiler (PyTorch workflow only)

### Collect PyTorch profiler results
1. Set environment variable `TLLM_PROFILE_START_STOP=A-B` to specify the range of the iterations to be collected.
2. Set environment variable `TLLM_TORCH_PROFILE_TRACE=<path>`, and the results will be saved to `<path>`.

### Visualize the PyTorch profiler results
Use <chrome://tracing/> to inspect the saved profile.


## Examples
Consult the Nsight Systems User Guide for full overview of MPI-related options.

### Profiling specific iterations on a trtllm-bench/trtllm-serve run

Say we want to profile iterations 100 to 150 on a trtllm-bench/trtllm-serve run, we want to collect as much information as possible for debugging, such as GIL, debugging NVTX markers, etc:

```bash
#!/bin/bash

# Prepare dataset for the benchmark
trtllm-bench \
    --model=${MODEL_PATH} prepare-dataset \
    --output /tmp/dataset.txt token-norm-dist --num-requests=${NUM_SAMPLES} \
    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0

# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
  -o trace -f true \
  -t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
  --cuda-graph-trace node \
  -e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
  --trace-fork-before-exec=true \
  trtllm-bench \ # or trtllm-serve command
    --model deepseek-ai/DeepSeek-V3 \
    --model_path ${MODEL_PATH} \
    throughput \
    --dataset /tmp/dataset.txt --warmup 0 \
    --streaming
```

The Nsight Systems reports will be saved to `trace.nsys-rep`. Use NVIDIA Nsight Systems application to open it.

The PyTorch profiler results will be saved to `trace.json`. Use <chrome://tracing/> to inspect the saved profile.

---

(perf-benchmarking)=

# TensorRT-LLM Benchmarking

```{important}
This benchmarking suite is a work in progress.
Expect breaking API changes.
```

TensorRT-LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:

- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
- An entirely Python workflow for benchmarking.
- Ability to benchmark various flows and features within TensorRT-LLM.

`trtllm-bench` executes all benchmarks using [in-flight batching] -- for more information see
the [this section](../advanced/gpt-attention.md#in-flight-batching) that describes the concept
in further detail.

## Before Benchmarking

For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.

### Persistence mode
Ensure persistence mode is enabled to maintain consistent GPU state:
```shell
sudo nvidia-smi -pm 1
```

### GPU Clock Management
Allow the GPU to dynamically adjust its clock speeds based on workload and temperature. While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance. Reset GPU clocks using:
```shell
sudo nvidia-smi -rgc
```

### Set power limits
First query the maximum power limit:
```shell
nvidia-smi -q -d POWER
```
Then configure the GPU to operate at its maximum power limit for consistent performance:
```shell
sudo nvidia-smi -pl <max_power_limit>
```

### Boost settings
Potentially a GPU may support boost levels. First query available boost levels:
```shell
sudo nvidia-smi boost-slider -l
```
If supported, enable the boost slider using one of the available levels for maximum performance:
```shell
sudo nvidia-smi boost-slider --vboost <max_boost_slider>
```


## Throughput Benchmarking

### Limitations and Caveats

#### Validated Networks for Benchmarking

While `trtllm-bench` should be able to run any network that TensorRT-LLM supports, the following are the list
that have been validated extensively and is the same listing as seen on the
[Performance Overview](./perf-overview.md) page.

- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)
- [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)
- [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b)
- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
- [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
- [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)
- [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B)
- [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
- [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

```{tip}
`trtllm-bench` can automatically download the model from Hugging Face Model Hub.
Export your token in the `HF_TOKEN` environment variable.
```

#### Supported Quantization Modes

`trtllm-bench` supports the following quantization modes:

- None (no quantization applied)
- `FP8`
- `NVFP4`

For more information about quantization, refer to [](../reference/precision.md) and
the [support matrix](../reference/precision.md#support-matrix) of the supported quantization methods for each network.

```{tip}
Although TensorRT-LLM supports more quantization modes than listed above, `trtllm-bench` currently only configures for
a smaller subset.
```

### Quickstart

This quick start focuses on running a short max throughput benchmark on
`meta-llama/Llama-3.1-8B` on a synthetic dataset with a uniform distribution of prompts with ISL:OSL
of 128:128.
To run the benchmark from start to finish, run the following commands:

```shell
trtllm-bench --tokenizer meta-llama/Llama-3.1-8B prepare-dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
trtllm-bench --model meta-llama/Llama-3.1-8B build --dataset /tmp/synthetic_128_128.txt --quantization FP8
trtllm-bench --model meta-llama/Llama-3.1-8B throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
```

After the benchmark completes, `trtllm-bench` prints a summary with summary metrics.

```shell
===========================================================
= ENGINE DETAILS
===========================================================
Model:                  meta-llama/Llama-3.1-8B
Engine Directory:       /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
TensorRT-LLM Version:   0.17.0
Dtype:                  bfloat16
KV Cache Dtype:         FP8
Quantization:           FP8
Max Input Length:       256
Max Sequence Length:    256

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 4096
Max Runtime Tokens:     8192
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   5.0689E+14

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
Token Throughput (tokens/sec):  28390.4265
Request Throughput (req/sec):   221.8002
Total Latency (ms):             13525.6862

===========================================================
```

### Workflow

The workflow for `trtllm-bench` is composed of the following steps:

1. Prepare a dataset to drive the inflight batching benchmark.
2. Build a benchmark engine using `trtllm-bench build` subcommand (not required for [PyTorch flow](#running-with-the-pytorch-workflow)).
3. Run the max throughput benchmark using the `trtllm-bench throughput` subcommand or low latency benchmark using the `trtllm-bench latency` subcommand.


#### Preparing a Dataset

The throughput benchmark utilizes a fixed JSON schema to specify requests. The schema is defined as follows:

| Key             | Required |     Type      | Description                                     |
| :-------------- | :------: | :-----------: | :---------------------------------------------- |
| `task_id`       |    Y     |    String     | Unique identifier for the request.              |
| `prompt`        |    N*    |    String     | Input text for a generation request.            |
| `input_ids`     |    Y*    | List[Integer] | List of logits that make up the request prompt. |
| `output_tokens` |    Y     |    Integer    | Number of generated tokens for this request.    |

```{tip}
\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`)
defined at the same time. If you specify `input_ids`, the `prompt` entry is ignored for request generation.
```

Refer to the following examples of valid entries for the benchmark:

- Entries with a human-readable prompt and no logits.

  ```json
  {"task_id": 1, "prompt": "Generate an infinite response to the following: This is the song that never ends, it goes on and on my friend.", "output_tokens": 1000}
  {"task_id": 2, "prompt": "Generate an infinite response to the following: Na, na, na, na", "output_tokens": 1000}
  ```

- Entries which contain logits.

  ```json
  {"task_id":0,"input_ids":[863,22056,25603,11943,8932,13195,3132,25032,21747,22213],"output_tokens":128}
  {"task_id":1,"input_ids":[14480,13598,15585,6591,1252,8259,30990,26778,7063,30065,21764,11023,1418],"output_tokens":128}
  ```

```{tip}
Specify each entry on one line.
To simplify passing the data, a complete JSON entry is on each line so that the benchmarker
can simply read a line and assume a complete entry. When creating a dataset, be sure that a complete
JSON entry is on every line.
```

In order to prepare a synthetic dataset, you can use the provided script in the `benchmarks/cpp`
directory. For example, to generate a synthetic dataset of 1000 requests with a uniform ISL/OSL of
128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:

```shell
trtllm-bench --tokenizer meta-llama/Llama-3.1-8B prepare-dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
```

### Building a Benchmark Engine

#### Default Build Behavior
The `trtllm-bench` CLI tool provides the `build` subcommand to build the TRT-LLM engines for max throughput benchmark.
To build an engine for benchmarking, you can specify the dataset generated with `prepare_dataset.py` through `--dataset` option.
By default, `trtllm-bench`'s tuning heuristic uses the high-level statistics of the dataset (average ISL/OSL, max sequence length)
to optimize engine build settings. The following command builds an FP8 quantized engine optimized using the dataset's ISL/OSL.

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --dataset /tmp/synthetic_128_128.txt
```

#### Other Build Modes

The build subcommand also provides other ways to build the engine where users have larger control over the tuning values.

- Build engine with self-defined tuning values:
You specify the tuning values to build the engine with by setting `--max_batch_size` and `--max_num_tokens` directly.
`max_batch_size` and `max_num_tokens` control the maximum number of requests and tokens that can be scheduled in each iteration.
If no value is specified, the default `max_batch_size` and `max_num_tokens` values of `2048` and `8192` are used.
The following command builds an FP8 quantized engine by specifying the engine tuning values.

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048
```


#### Parallelism Mapping Support
The `trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`) `<=` `8`. The parallelism mapping in build subcommad is controlled by `--tp_size` and `--pp_size` options. The following command builds an engine with TP2-PP2 mapping.

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --dataset /tmp/synthetic_128_128.txt --tp_size 2 --pp_size 2
```


#### Example of Build Subcommand Output:
The output of the `build` subcommand looks similar to the snippet below (for `meta-llama/Llama-3.1-8B`):

```shell
user@387b12598a9e:/scratch/code/trt-llm/tekit_2025$ trtllm-bench --model meta-llama/Llama-3.1-8B build --dataset /tmp/synthetic_128_128.txt --quantization FP8
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[01/18/2025-00:55:14] [TRT-LLM] [I] Found dataset.
[01/18/2025-00:55:14] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Max Input Sequence Length:      128
Max Output Sequence Length:     128
Max Sequence Length:    256
Target (Average) Input Sequence Length: 128
Target (Average) Output Sequence Length:        128
Number of Sequences:    3000
===========================================================


[01/18/2025-00:55:14] [TRT-LLM] [I] Max batch size and max num tokens are not provided, use tuning heuristics or pre-defined setting from trtllm-bench.
[01/18/2025-00:55:14] [TRT-LLM] [I] Estimated total available memory for KV cache: 132.37 GB
[01/18/2025-00:55:14] [TRT-LLM] [I] Estimated total KV cache memory: 125.75 GB
[01/18/2025-00:55:14] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 8048.16
[01/18/2025-00:55:14] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 4096
[01/18/2025-00:55:14] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 8192
[01/18/2025-00:55:14] [TRT-LLM] [I] Set dtype to bfloat16.
[01/18/2025-00:55:14] [TRT-LLM] [I] Set multiple_profiles to True.
[01/18/2025-00:55:14] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[01/18/2025-00:55:14] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[01/18/2025-00:55:14] [TRT-LLM] [I]
===========================================================
= ENGINE BUILD INFO
===========================================================
Model Name:             meta-llama/Llama-3.1-8B
Model Path:             None
Workspace Directory:    /tmp
Engine Directory:       /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1

===========================================================
= ENGINE CONFIGURATION DETAILS
===========================================================
Max Sequence Length:            256
Max Batch Size:                 4096
Max Num Tokens:                 8192
Quantization:                   FP8
KV Cache Dtype:                 FP8
===========================================================

Loading Model: [1/3]    Downloading HF model
Downloaded model to /data/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20afa771667e1d4b
Time: 0.321s
Loading Model: [2/3]    Loading HF model to memory
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:59<00:00, 14.79s/it]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████| 287113/287113 [00:06<00:00, 41375.57 examples/s]
Generating validation split: 100%|█████████████████████████████████████████████████████████████████████████████████| 13368/13368 [00:00<00:00, 41020.63 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████| 11490/11490 [00:00<00:00, 41607.11 examples/s]
Inserted 675 quantizers
/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_quant.py:71: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
  warnings.warn(
Disable lm_head quantization for TRT-LLM export due to deployment limitations.
current rank: 0, tp rank: 0, pp rank: 0
Time: 122.568s
Loading Model: [3/3]    Building TRT-LLM engine
/usr/local/lib/python3.12/dist-packages/tensorrt/__init__.py:85: DeprecationWarning: Context managers for TensorRT types are deprecated. Memory will be freed automatically when the reference count reaches 0.
  warnings.warn(
Time: 53.820s
Loading model done.
Total latency: 176.709s

<snip verbose logging>

===========================================================
ENGINE SAVED: /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
===========================================================
```

The engine in this case will be written to `/tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1` (the end of the log).


### Max Throughput Benchmark

The `trtllm-bench` command line tool provides a max throughput benchmark that is accessible via the
`throughput` subcommand. This benchmark tests a TensorRT-LLM engine or PyTorch backend under maximum load to provide an
upper bound throughput number.

#### How the Benchmarker Works

The benchmarker reads a data file where a single line contains
a complete JSON request entry as specified in [](#preparing-a-dataset).
The process that the benchmarker is as follows:

1. Iterate over all input requests. If `logits` is specified, construct the request using the specified
list of logits. Otherwise, tokenize the `prompt` with as specified by `--model $HF_MODEL_NAME`.
1. Submit the dataset to the TensorRT-LLM `Executor` API as fast as possible (offline mode).
1. Wait for all requests to return, compute statistics, and then report results.

To run the benchmarker, run the following commands with the [engine](#building-a-benchmark-engine) and
[dataset](#preparing-a-dataset) generated from previous steps:

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[01/18/2025-01:01:13] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/18/2025-01:01:13] [TRT-LLM] [I] Setting up throughput benchmark.

<snip verbose logging>

[01/18/2025-01:01:26] [TRT-LLM] [I] Setting up for warmup...
[01/18/2025-01:01:26] [TRT-LLM] [I] Running warmup.
[01/18/2025-01:01:26] [TRT-LLM] [I] Starting benchmarking async task.
[01/18/2025-01:01:26] [TRT-LLM] [I] Starting benchmark...
[01/18/2025-01:01:26] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0000s, rate=121847.20 req/s]
[01/18/2025-01:01:28] [TRT-LLM] [I] Benchmark complete.
[01/18/2025-01:01:28] [TRT-LLM] [I] Stopping LLM backend.
[01/18/2025-01:01:28] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[01/18/2025-01:01:28] [TRT-LLM] [I] All tasks cancelled.
[01/18/2025-01:01:28] [TRT-LLM] [I] LLM Backend stopped.
[01/18/2025-01:01:28] [TRT-LLM] [I] Warmup done.
[01/18/2025-01:01:28] [TRT-LLM] [I] Starting benchmarking async task.
[01/18/2025-01:01:28] [TRT-LLM] [I] Starting benchmark...
[01/18/2025-01:01:28] [TRT-LLM] [I] Request submission complete. [count=3000, time=0.0012s, rate=2590780.97 req/s]
[01/18/2025-01:01:42] [TRT-LLM] [I] Benchmark complete.
[01/18/2025-01:01:42] [TRT-LLM] [I] Stopping LLM backend.
[01/18/2025-01:01:42] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[01/18/2025-01:01:42] [TRT-LLM] [I] All tasks cancelled.
[01/18/2025-01:01:42] [TRT-LLM] [I] LLM Backend stopped.
[01/18/2025-01:01:42] [TRT-LLM] [I]

===========================================================
= ENGINE DETAILS
===========================================================
Model:                  meta-llama/Llama-3.1-8B
Engine Directory:       /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
TensorRT-LLM Version:   0.17.0
Dtype:                  bfloat16
KV Cache Dtype:         FP8
Quantization:           FP8
Max Input Length:       256
Max Sequence Length:    256

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 4096
Max Runtime Tokens:     8192
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   5.0689E+14

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
Token Throughput (tokens/sec):  28390.4265
Request Throughput (req/sec):   221.8002
Total Latency (ms):             13525.6862

===========================================================

[01/18/2025-01:01:42] [TRT-LLM] [I] Thread proxy_dispatch_result_thread stopped.
[TensorRT-LLM][INFO] Refreshed the MPI local session
```

### Running with the PyTorch Workflow

```{eval-rst}
.. include:: ../../_includes/note_sections.rst
   :start-after: .. start-note-config-flag-alias
   :end-before: .. end-note-config-flag-alias
```

To benchmark the PyTorch backend (`tensorrt_llm._torch`), use the following command with [dataset](#preparing-a-dataset) generated from previous steps. With the PyTorch flow, you will not need to
run `trtllm-bench build`; the `throughput` benchmark initializes the backend by tuning against the
dataset provided via `--dataset` (or the other build mode settings described [above](#other-build-modes)).
Note that CUDA graph is enabled by default. You can add additional pytorch config with
`--config` followed by the path to a YAML file. For more details, please refer to the
help text by running the command with `--help`.

```{tip}
The command below specifies the `--model_path` option. The model path is optional and used only when you want to run a locally
stored checkpoint. When using `--model_path`, the `--model` is still required for reporting reasons and in order to look up parameters
for build heuristics.
```

```shell
trtllm-bench --model meta-llama/Llama-3.1-8B --model_path /Ckpt/Path/To/Llama-3.1-8B throughput --dataset /tmp/synthetic_128_128.txt

# Example output
<snip verbose logging>
===========================================================
= PyTorch backend
===========================================================
Model:                  meta-llama/Llama-3.1-8B
Model Path:             /Ckpt/Path/To/Llama-3.1-8B
TensorRT-LLM Version:   0.17.0
Dtype:                  bfloat16
KV Cache Dtype:         None
Quantization:           FP8

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 2048
Max Runtime Tokens:     4096
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   7.6753E+14

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
Token Throughput (tokens/sec):  20685.5510
Request Throughput (req/sec):   161.6059
Total Latency (ms):             18563.6825

```

#### Benchmarking with LoRA Adapters in PyTorch workflow

The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters. This requires preparing a dataset with LoRA metadata and configuring the LoRA settings.

**Preparing LoRA Dataset**

Use `prepare_dataset.py` with LoRA-specific options to generate requests with LoRA metadata:

```shell
python3 benchmarks/cpp/prepare_dataset.py \
  --stdout \
  --rand-task-id 0 1 \
  --tokenizer /path/to/tokenizer \
  --lora-dir /path/to/loras \
  token-norm-dist \
  --num-requests 100 \
  --input-mean 128 \
  --output-mean 128 \
  --input-stdev 16 \
  --output-stdev 24 \
  > synthetic_lora_data.json
```

Key LoRA options:
- `--lora-dir`: Parent directory containing LoRA adapter subdirectories named by their task IDs (e.g., `0/`, `1/`, etc.)
- `--rand-task-id`: Range of LoRA task IDs to randomly assign to requests
- `--task-id`: Fixed LoRA task ID for all requests (alternative to `--rand-task-id`)

The generated dataset will include LoRA request metadata. Below is an example of a single such request data entry:

```json
{
  "task_id": 0,
  "input_ids": [3452, 88226, 102415, ...],
  "output_tokens": 152,
  "lora_request": {
    "lora_name": "lora_0",
    "lora_int_id": 0,
    "lora_path": "/path/to/loras/0"
  }
}
```

**LoRA Configuration**

Create a `config.yaml` file with LoRA configuration:

```yaml
lora_config:
  lora_dir:
    - /path/to/loras/0
    - /path/to/loras/1
  max_lora_rank: 64
  lora_target_modules:
    - attn_q
    - attn_k
    - attn_v
  trtllm_modules_to_hf_modules:
    attn_q: q_proj
    attn_k: k_proj
    attn_v: v_proj
```

**Running LoRA Benchmark**

```shell
trtllm-bench --model /path/to/base/model \
  throughput \
  --dataset synthetic_lora_data.json \
  --config config.yaml
```

```{note}
The LoRA directory structure should have task-specific subdirectories named by their task IDs (e.g., `loras/0/`, `loras/1/`).
Each subdirectory should contain the LoRA adapter files for that specific task.
```

#### Running multi-modal models in the PyTorch Workflow

To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.

First, prepare the dataset:
```
python ./benchmarks/cpp/prepare_dataset.py \
  --tokenizer Qwen/Qwen2-VL-2B-Instruct \
  --stdout \
  dataset \
  --dataset-name lmms-lab/MMMU \
  --dataset-split test \
  --dataset-image-key image \
  --dataset-prompt-key question \
  --num-requests 10 \
  --output-len-dist 128,5 > mm_data.jsonl
```
It will download the media files to `/tmp` directory and prepare the dataset with their paths. Note that the `prompt` fields are texts and not tokenized ids. This is due to the fact that
the `prompt` and the media (image/video) are processed by a preprocessor for multimodal files.

Sample dataset for multimodal:
```
{"task_id":0,"prompt":"Brahma Industries sells vinyl replacement windows to home improvement retailers nationwide. The national sales manager believes that if they invest an additional $25,000 in advertising, they would increase sales volume by 10,000 units. <image 1> What is the total contribution margin?","media_paths":["/tmp/tmp9so41y3r.jpg"],"output_tokens":126}
{"task_id":1,"prompt":"Let us compute for the missing amounts under work in process inventory, what is the cost of goods manufactured? <image 1>","media_paths":["/tmp/tmpowsrb_f4.jpg"],"output_tokens":119}
{"task_id":2,"prompt":"Tsuji is reviewing the price of a 3-month Japanese yen/U.S. dollar currency futures contract, using the currency and interest rate data shown below. Because the 3-month Japanese interest rate has just increased to .50%, Itsuji recognizes that an arbitrage opportunity exists nd decides to borrow $1 million U.S. dollars to purchase Japanese yen. Calculate the yen arbitrage profit from Itsuji's strategy, using the following data: <image 1> ","media_paths":["/tmp/tmpxhdvasex.jpg"],"output_tokens":126}
...
```

Run the benchmark:
```
trtllm-bench --model Qwen/Qwen2-VL-2B-Instruct \
  throughput \
  --dataset mm_data.jsonl \
  --num_requests 10 \
  --max_batch_size 4 \
  --modality image
```


Sample output:
```
===========================================================
= REQUEST DETAILS
===========================================================
Number of requests:             10
Number of concurrent requests:  5.3019
Average Input Length (tokens):  411.6000
Average Output Length (tokens): 128.7000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 4
Max Runtime Tokens:     12288
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   1.4117E+17

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     1.4439
Total Output Throughput (tokens/sec):             185.8351
Per User Output Throughput (tokens/sec/user):     38.1959
Per GPU Output Throughput (tokens/sec/gpu):       185.8351
Total Token Throughput (tokens/sec):              780.1607
Total Latency (ms):                               6925.4963
Average request latency (ms):                     3671.8441

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 3936.3022
[Latency] P90    : 5514.4701
[Latency] P95    : 5514.4701
[Latency] P99    : 5514.4701
[Latency] MINIMUM: 2397.1047
[Latency] MAXIMUM: 5514.4701
[Latency] AVERAGE: 3671.8441

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspaces/tensorrt_llm/mm_data.jsonl
Number of Sequences:  10

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   167.0000           119.0000           300.0000
MAX:  1059.0000           137.0000          1178.0000
AVG:   411.6000           128.7000           540.3000
P50:   299.0000           128.0000           427.0000
P90:  1059.0000           137.0000          1178.0000
P95:  1059.0000           137.0000          1178.0000
P99:  1059.0000           137.0000          1178.0000
===========================================================
```

**Notes and Limitations**:
- Only image datasets are supported for now.
- `--output-len-dist` is a required argument for multimodal datasets.
- Tokenizer is unused during the prepare step but it is still a required argument.
- Since the images are converted to tokens when the model is run, `trtllm-bench` uses a default large value for the maximum input sequence length when setting up the execution settings.
  You can also modify the behavior by specifying a different value with the flag `--max_input_len` that suits your use-case.

#### Quantization in the PyTorch Flow

In order to run a quantized run with `trtllm-bench` utilizing the PyTorch flow, you will need to use a pre-quantized
To run a quantized benchmark with `trtllm-bench` utilizing the PyTorch flow, you will need to use a pre-quantized
checkpoint. For the Llama-3.1 models, TensorRT-LLM provides the following checkpoints via HuggingFace:

- [`nvidia/Llama-3.1-8B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8)
- [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)
- [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)

`trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints
above:

```json
{
    "producer": {
        "name": "modelopt",
        "version": "0.23.0rc1"
    },
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": null
    }
}
```

The checkpoints above are quantized to run with a compute precision of `FP8` and default to no KV cache quantization (full
`FP16` cache). When running `trtllm-bench throughput`. The benchmark will select a KV cache quantization that is best suited
for the compute precision in the checkpoint automatically if `kv_cache_quant_algo` is specified as `null`, otherwise it will
be forced to match the specified non-null KV cache quantization. The following are the mappings that `trtllm-bench` will
follow when a checkpoint does not specify a KV cache quantization algorithm:

| Checkpoint Compute Quant | Checkpoint KV Cache Quant | `trtllm-bench` | Note |
| - | - | - | - |
| `null` | `null` | `null` | In this case, a quantization config doesn't exist. |
| `FP8` | `FP8` | `FP8` | Matches the checkpoint |
| `FP8` | `null` | `FP8` | Set to `FP8` via benchmark |
| `NVFP4` | `null` | `FP8` | Set to `FP8` via benchmark |

If you would like to force the KV cache quantizaton, you can specify the following in the YAML file to force the precision
when the checkpoint precision is `null`:

```yaml
kv_cache_dtype: "fp8"
```

```{tip}
The two valid values for `kv_cache_dtype` are `auto` and `fp8`.
```

## Low Latency Benchmark

The low latency benchmark follows a similar workflow to the [throughput benchmark](#max-throughput-benchmark)
but requires building the engine separately from `trtllm-bench`. Low latency benchmarks has the following modes:

- A single-request low-latency engine
- A Medusa-enabled speculative-decoding engine

### Low Latency TensorRT-LLM Engine for Llama-3 70B

To build a low-latency engine for the latency benchmark, run the following quantize and build commands.
The `$checkpoint_dir` is the path to the [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) Hugging Face checkpoint in your cache or downloaded to a specific location with the [huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
To prepare a dataset, follow the same process as specified in [](#preparing-a-dataset).

#### Benchmarking a non-Medusa Low Latency Engine

To quantize the checkpoint:

```shell
cd tensorrt_llm/examples/models/core/llama
python ../quantization/quantize.py \
    --model_dir $checkpoint_dir \
    --dtype bfloat16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /tmp/meta-llama/Meta-Llama-3-70B/checkpoint \
    --calib_size 512 \
    --tp_size $tp_size
```

then build,

```shell
trtllm-build \
    --checkpoint_dir /tmp/meta-llama/Meta-Llama-3-70B/checkpoint \
    --use_fused_mlp enable \
    --gpt_attention_plugin bfloat16 \
    --output_dir /tmp/meta-llama/Meta-Llama-3-70B/engine \
    --max_batch_size 1 \
    --max_seq_len $(($isl+$osl)) \
    --reduce_fusion enable \
    --gemm_plugin fp8 \
    --workers $tp_size \
    --use_fp8_context_fmha enable \
    --max_num_tokens $isl \
    --use_paged_context_fmha disable \
    --multiple_profiles enable
```

After the engine is built, run the low-latency benchmark:

```shell
env TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 \
  TRTLLM_MMHA_KERNEL_BLOCK_SIZE=256 \
  TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=32 \
  FORCE_MULTI_BLOCK_MODE=ON \
  TRTLLM_ENABLE_PDL=1 \
  trtllm-bench --model meta-llama/Meta-Llama-3-70B \
  latency \
  --dataset $DATASET_PATH \
  --engine_dir /tmp/meta-llama/Meta-Llama-3-70B/engine
```

### Building a Medusa Low-Latency Engine

To build a Medusa-enabled engine requires checkpoints that contain Medusa heads.
NVIDIA provides TensorRT-LLM checkpoints on the [NVIDIA](https://huggingface.co/nvidia) page on Hugging Face.
The checkpoints are pre-quantized and can be directly built after downloading them with the
[huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
After you download the checkpoints, run the following command. Make sure to
specify the `$tp_size` supported by your Medusa checkpoint and the path to its stored location `$checkpoint_dir`.
Additionally, `$max_seq_len` should be set to the model's maximum position embedding.

Using Llama-3.1 70B as an example, for a tensor parallel 8 and bfloat16 dtype:

```shell
tp_size=8
max_seq_len=131072
trtllm-build --checkpoint_dir $checkpoint_dir \
    --speculative_decoding_mode medusa \
    --max_batch_size 1 \
    --gpt_attention_plugin bfloat16 \
    --max_seq_len $max_seq_len \
    --output_dir /tmp/meta-llama/Meta-Llama-3.1-70B/medusa/engine \
    --use_fused_mlp enable \
    --paged_kv_cache enable \
    --use_paged_context_fmha disable \
    --multiple_profiles enable \
    --reduce_fusion enable \
    --use_fp8_context_fmha enable \
    --workers $tp_size \
    --low_latency_gemm_plugin fp8
```

After the engine is built, you need to define the Medusa choices.
The choices are specified with a YAML file like the following example (`medusa.yaml`):

```yaml
- [0]
- [0, 0]
- [1]
- [0, 1]
- [2]
- [0, 0, 0]
- [1, 0]
- [0, 2]
- [3]
- [0, 3]
- [4]
- [0, 4]
- [2, 0]
- [0, 5]
- [0, 0, 1]
```

To run the Medusa-enabled engine, run the following command:

```shell
env TRTLLM_ENABLE_PDL=1 \
  UB_ONESHOT=1 \
  UB_TP_SIZE=$tp_size \
  TRTLLM_ENABLE_PDL=1 \
  TRTLLM_PDL_OVERLAP_RATIO=0.15 \
  TRTLLM_PREFETCH_RATIO=-1 \
  trtllm-bench --model meta-llama/Meta-Llama-3-70B \
  latency \
  --dataset $DATASET_PATH \
  --engine_dir /tmp/meta-llama/Meta-Llama-3-70B/medusa/engine \
  --medusa_choices medusa.yml
```

## Summary

The following table summarizes the commands needed for running benchmarks:

| Scenario | Phase | Command |
| - | - | - |
| Dataset | Preparation | `python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer $HF_MODEL token-norm-dist --input-mean $ISL --output-mean $OSL --input-stdev 0 --output-stdev 0 --num-requests $NUM_REQUESTS > $DATASET_PATH` |
| Throughput | Build | `trtllm-bench --model $HF_MODEL build --dataset $DATASET_PATH` |
| Throughput | Benchmark | `trtllm-bench --model $HF_MODEL throughput --dataset $DATASET_PATH --engine_dir $ENGINE_DIR` |
| Latency | Build | See [section about building low latency engines](#low-latency-tensorrt-llm-engine-for-llama-3-70b) |
| Non-Medusa Latency | Benchmark | `trtllm-bench --model $HF_MODEL latency --dataset $DATASET_PATH --engine_dir $ENGINE_DIR` |
| Medusa Latency | Benchmark | `trtllm-bench --model $HF_MODEL latency --dataset $DATASET_PATH --engine_dir $ENGINE_DIR --medusa_choices $MEDUSA_CHOICES` |

where,

`$HF_MODEL`
: The Hugging Face name of a model.

`$NUM_REQUESTS`
: The number of requests to generate.

`$DATASET_PATH`
: The path where the dataset was written when preparing the dataset.

`$ENGINE_DIR`
: The engine directory as printed by `trtllm-bench build`.

`$MEDUSA_CHOICES`
: A YAML config representing the Medusa tree for the benchmark.

---

(benchmarking-default-performance)=

# Benchmarking Default Performance

This section discusses how to build an engine for the model using the LLM-API and benchmark it using TRTLLM-Bench.

> Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.

## Before You Begin: TensorRT-LLM LLM-API

TensorRT-LLM's LLM-API aims to make getting started with TensorRT-LLM quick and easy. For example, the following script instantiates `Llama-3.3-70B-Instruct` and runs inference on a small set of prompts. For those familiar with TensorRT-LLM's [CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli), the call to `LLM()` handles converting the model checkpoint and building the engine in one line.

```python
#quickstart.py
from tensorrt_llm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct", #HuggingFace model name, no need to download the checkpoint beforehand
    tensor_parallel_size=4
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()
```
### Troubleshooting Tips and Pitfalls To Avoid

Since we are running on multiple GPUs, MPI is used to spawn processes for each GPU. This raises the following requirements

1. The entrypoint to the script should be guarded via `if __name__ == '__main__'`. This requirement comes from mpi4py.
2. Depending on your environment, it might be required to wrap the `python` command with `mpirun`. For example the command to run the script above could be `mpirun -n 1 --oversubscribe --allow-run-as-root python quickstart.py`. For running on multiple GPUs on one node like the example is attempting to do it is usually not required to prefix with `mpirun` but if you are getting MPI errors then you should add it. Additionally, the `-n 1` which says just to launch one process is intentional as TensorRT-LLM handles spawning the processes for the remaining GPUs
3. If you get a HuggingFace access error when loading the Llama weights, this is likely because the model is gated. Request access on the HuggingFace page for the model. Then follow the instructions on [Huggingface's quickstart guide](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) to authenticate in your environment.


## Building and Saving the Engine

Save the engine using `.save()`. Just like the previous example, this script and all subsequent scripts might need to be run via `mpirun`.

```python
from tensorrt_llm import LLM

def main():
    llm = LLM(
        model="/scratch/Llama-3.3-70B-Instruct",
        tensor_parallel_size=4
    )

    llm.save("baseline")

if __name__ == '__main__':
    main()
```

### Building and Saving Engines via CLI

TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps

1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/convert_checkpoint.py)
2. Build engine by passing TensorRT-LLM checkpoint to `trtllm-build` command. The `trtllm-build` command is installed automatically when the `tensorrt_llm` package is installed.

The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).

## Benchmarking with `trtllm-bench`

`trtllm-bench` provides a command line interface for benchmarking the throughput and latency of saved engines.

### Prepare Dataset

`trtllm-bench` expects to be passed in a dataset of requests to run through the model. This guide creates a dummy dataset of 1000 requests with every request having input and output sequence length of 2048.  TensorRT-LLM provides the `prepare_dataset.py` script to produce the dataset. To use it clone the TensorRT-LLM Repo and run the following command:

`python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer /path/to/hf/Llama-3.3-70B-Instruct/ token-norm-dist --input-mean 2048 --output-mean 2048 --input-stdev 0 --output-stdev 0 --num-requests 1000 > synthetic_2048_2048.txt`

`trtllm-bench` can also take in real data, see [`trtllm-bench` documentation](../perf-benchmarking.md) for more details on the required format.

### Running Throughput and Latency Benchmarks

 To benchmark the baseline engine built in the previous script, run the following commands. Again, due to the multi-gpu nature of the workload you may need prefix the `trtllm-bench` command with `mpirun -n 1 --oversubscribe --allow-run-as-root`.

**Throughput**

```bash
trtllm-bench \
--model /path/to/hf/Llama-3.3-70B-Instruct/ \
throughput \
--dataset /path/to/dataset/synthetic_2048_2048_1000.txt \
--engine_dir /path/to/engines/baseline #replace baseline with name used in llm.save()
```

This command will send all 1000 requests to the model immediately. Run `trtllm-bench throughput -h` to see a list of options that help you control the request rate and cap the total number of requests if the benchmark is taking too long. For reference, internal testing of the above command took around 20 minutes on a 4 NVLink connected H100-sxm-80GB.

Running this command will provide a throughput overview like this:

```bash
===========================================================
= ENGINE DETAILS
===========================================================
Model:			/scratch/Llama-3.3-70B-Instruct/
Engine Directory:	/scratch/grid_search_engines/baseline
TensorRT-LLM Version:	0.16.0
Dtype:			bfloat16
KV Cache Dtype:		None
Quantization:		None
Max Sequence Length:	131072

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:		4
PP Size:		1
Max Runtime Batch Size:	2048
Max Runtime Tokens:	8192
Scheduling Policy:	Guaranteed No Evict
KV Memory Percentage:	90.00%
Issue Rate (req/sec):	7.9353E+13

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:		1000
Average Input Length (tokens):	2048.0000
Average Output Length (tokens):	2048.0000
Token Throughput (tokens/sec):	1585.7480
Request Throughput (req/sec):	0.7743
Total Latency (ms):		1291504.1051

===========================================================
```

**Latency**

```bash
trtllm-bench \
--model /path/to/hf/Llama-3.3-70B-Instruct/ \
latency \
--dataset /path/to/dataset/synthetic_2048_2048_1000.txt \
--num-requests 100 \
--warmup 10 \
--engine_dir /path/to/engines/baseline #replace baseline with name used in llm.save()
```
The latency benchmark enforces a batch size of 1 to accurately measure latency, which can significantly increase testing duration. In the example above the total number of requests is limited to 100 via `--num-requests` to make the test duration more manageable. This example benchmark was designed to produce very stable numbers, but in real scenarios even 100 requests is likely more than you need and can take a long time to complete (in the case-study it took about an hour and a half). Reducing the number of requests to 10 would still provide accurate data and enable faster development iterations. In general you should adjust the number of requests per your needs. Run `trtllm-bench latency -h` to see other configurable options.

Running this command will provide a latency overview like this:

```bash
===========================================================
= ENGINE DETAILS
===========================================================
Model:			/scratch/Llama-3.3-70B-Instruct/
Engine Directory:	/scratch/grid_search_engines/baseline
TensorRT-LLM Version:	0.16.0
Dtype:			bfloat16
KV Cache Dtype:		None
Quantization:		None
Max Input Length:	1024
Max Sequence Length:	131072

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:		4
PP Size:		1
Max Runtime Batch Size:	1
Max Runtime Tokens:	8192
Scheduling Policy:	Guaranteed No Evict
KV Memory Percentage:	90.00%

===========================================================
= GENERAL OVERVIEW
===========================================================
Number of requests:		100
Average Input Length (tokens):	2048.0000
Average Output Length (tokens):	2048.0000
Average request latency (ms):	63456.0704

===========================================================
= THROUGHPUT OVERVIEW
===========================================================
Request Throughput (req/sec):		  0.0158
Total Token Throughput (tokens/sec):	  32.2742
Generation Token Throughput (tokens/sec): 32.3338

===========================================================
= LATENCY OVERVIEW
===========================================================
Total Latency (ms):		  6345624.0554
Average time-to-first-token (ms): 147.7502
Average inter-token latency (ms): 30.9274
Acceptance Rate (Speculative):	  1.00

===========================================================
= GENERATION LATENCY BREAKDOWN
===========================================================
MIN (ms): 63266.8804
MAX (ms): 63374.7770
AVG (ms): 63308.3201
P90 (ms): 63307.1885
P95 (ms): 63331.7136
P99 (ms): 63374.7770

===========================================================
= ACCEPTANCE BREAKDOWN
===========================================================
MIN: 1.00
MAX: 1.00
AVG: 1.00
P90: 1.00
P95: 1.00
P99: 1.00

===========================================================
```

## Results

The baseline engine achieves the following performance for token throughput, request throughput, average time to first token, and average inter-token latency. These metrics will be analyzed throughout the guide.


| Metric                        | Value         |
|-------------------------------|---------------|
| Token Throughput (tokens/sec) | 1564.3040    |
| Request Throughput (req/sec)  | 0.7638        |
| Average Time To First Token (ms) | 147.6976  |
| Average Inter-Token Latency (ms) | 31.3276    |

The following sections show ways you can improve these metrics using different configuration options.

---

(deciding-model-sharding-strategy)=

# Deciding Model Sharding Strategy

Large models often can't fit on one GPU and need to be sharded across multiple GPUs. The sharding strategies used to accomplish this can have significant impacts on performance. This guide walks through how to determine if tensor parallelism, pipeline parallelism, or a mix of both are the best strategy for you. If you are not familiar with tensor parallelism and pipeline parallelism please refer to [Mastering LLM Techniques - Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)


## How to Think about Model Sharding: Communication is Key

Splitting your model weights between multiple GPUs requires them to communicate activations between each other, adding additional overhead. How expensive this overhead is on your system is the key factor in determining the best strategy for you.

In pipeline parallelism, the model is split into sets of contiguous layers and each GPU houses one of these sets. In this setup, the only required communication is for each GPU to send the outputs of its set to the GPU with the next set.

![Pipeline Parallel Visualization](../../media/Pipeline_Parallel_Vis.svg)


 On the other hand, tensor parallelism takes each layer of the model and splits it between the GPUs. This means that every GPU houses a portion of every layer. However since each layer needs the full outputs of the previous layer as an input, each GPU has to perform the heavier All-Reduce communication operation to share its results with all other GPUs before it can begin processing the next layer. While this seems disadvantageous, because each GPU only holds partial layers, it also performs smaller matrix multiplications, allowing it to compute its outputs quicker.

 ![Tensor Parallel Visualization](../../media/Tensor_Parallelism_Vis.svg)


 Ultimately deciding the best strategy comes down to whether the extra overhead from the All-Reduce operation overshadows the gains from the smaller matrix multiplications. If the interconnects between the GPUs are sufficiently fast, the gains from the reduced computation burden per layer can outweigh the additional communication cost. Consequently, a general rule of thumb is that if your GPUs have fast connections between them like NVLink then tensor parallel is likely a good choice. However if the communication will go over slow connections (across nodes for example) pipeline parallel is likely better. Overall we provide the following guidelines:

**If your model fits on one gpu:** Unless you have a very specific reason, don't shard your model. The best communication overhead is no communication overhead.

**If your model fits in one node:** Tensor parallel is likely the best option here, especially if you have fast connections between the GPUs like NVLink. If you don't, then pipeline parallel might be needed. Start with tensor parallel and sanity check if pipeline parallel is better.

**If your model is sharded across multiple nodes:** Inter-node connections are typically significantly slower than intra-node connections, so if you have tensor parallelism across nodes it will be bottlenecked by the slow interconects. Consequently, a good starting point is having tensor parallelism within the node and pipeline parallelism between nodes. An exception is if you are running on NVL36 or NVL72 Blackwell systems. These have multinode NVLink so as long as you stay within the 36 or 72 GPUs, tensor parallel won't be bottlenecked by inter-node connections.

## How to set Tensor Parallelism and Pipeline Parallelism

The `LLM` class takes `tensor_parallel_size` and `pipeline_parallel_size` as parameters. `tensor_parallel_size * pipeline_parallel_size` should be equal to the total number of GPUs you are sharding the model over, referred to as the world size. For example, if you were sharding a model over 2 nodes, each with 16 GPUs, you might set tensor parallel to 8 (for tensor parallelism within the node) and pipeline parallel to 2 (pipeline parallel between nodes) like this:

```python
    llm = LLM(
        model="/scratch/Llama-3.1-405B-Instruct",
        tensor_parallel_size=8,
        pipeline_parallel_size=2
    )
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) you can specify tensor parallelism and pipeline parallelism by providing the `--tp_size` and `--tp_size` arguments to `convert_checkpoint.py`

```
python examples/models/core/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
                            --output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
                            --dtype float16 \
                            --tp_size 8
                            --pp_size 2
```

---

(fp8-quantization)=

# FP8 Quantization

Quantization is a technique that allows models to run in lower precisions like int8 and fp8 while maintaining acceptable output quality. Running in lower precisions can greatly boost performance, significantly increasing throughput and decreasing latency. The tradeoff is a drop in output quality, but in many cases the output quality is still acceptable and many real world deployments utilize quantization. If you want to learn more about quantization refer to [Mastering LLM Techniques - Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)

This section walks through enabling fp8 quantization and highlight some fp8 quantization specific configuration options for boosting performance. It also continues the case study of Llama-3.3-70B split across 4 H100-sxm-80GB GPUs via tensor parallelism and showcase the effects of enabling these configuration options on performance.

> Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.

## Enabling Quantization

To enable quantization you need to configure the `QuantConfig` class and pass it to the `quant_config` parameter of the LLM class. At a minimum the `quant_algo` parameter, which sets the quantization algorithm (fp8, fp8 per token, int8awq, etc.) must be specified. You can find all supported quantization algorithms and other configurable options for `QuantConfig` in the LLM-API->Reference section of the docs. While it is not required if you are using weights/checkpoints from that are already quantized, if you are using an fp16 checkpoint then you also need to specify the calibration dataset that will be used to determine the quantization scales via `CalibConfig`. `CalibConfig` provides several options for setting the calibration dataset that can also be referenced in the LLM-API->Reference section of the docs. Although TensorRT-LLM supports several other types of quantization, this guide focuses on fp8.


Here is an example of building and saving an fp8 engine from a bf16 checkpoint (Note that fp8 is supported only on devices with compute capability > 8.9 - Ada, Hopper, Blackwell, and beyond):
```python
from tensorrt_llm import LLM, BuildConfig
from tensorrt_llm.llmapi import QuantConfig, QuantAlgo, CalibConfig

def main():

    quant_config = QuantConfig(quant_algo=QuantAlgo.FP8)

    calib_config = CalibConfig(
        calib_batches=512,
        calib_batch_size=1,
        calib_max_seq_length=2048,
        tokenizer_max_seq_length=4096
    )

    build_config = BuildConfig(
        max_num_tokens=2048,
        max_batch_size=512,
    )

    build_config.plugin_config.use_paged_context_fmha = True
    build_config.plugin_config.multiple_profiles = True

    llm = LLM(
        model="/path/to/Llama-3.3-70B",
        tensor_parallel_size=4,
        pipeline_parallel_size=1,
        build_config=build_config,
        quant_config=quant_config,
        calib_config=calib_config
    )

    llm.save("baseline_fp8_engine")

if __name__ == '__main__':
    main()
```

For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.

> ***Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.***

## FP8 "Baseline" Performance

Benchmarking the engine produced by the example above yielded the following performance results. Note that we enabled some of the build flags we mentioned [earlier](./useful-build-time-flags.md) (multiple profiles, paged_context_fmha) and also tuned max batch size and max num tokens. This is done to give a sense of what performance is achievable if you tune an fp8 engine but exclude options that have been tailored for quantization. We recommend disabling the gemm plugin for quantized engines which is why it is not included here (it is off by default). Reduce fusion has a quantization specific optimization that will be covered later. For the remainder of this page we will refer to this setup as the "baseline" numbers for fp8.


| Metric                           | Value     |
| -------------------------------- | --------- |
| Token Throughput (tokens/sec)    | 3389.5305 |
| Request Throughput (req/sec)     | 1.6550    |
| Average Time To First Token (ms) | 96.1597   |
| Average Inter-Token Latency (ms) | 12.4248   |


## Quantized KV-Cache

By default the KV-Cache is not quantized but TensorRT-LLM supports quantizing the KV-Cache to further improve performance. However, quantizing the model more aggressively also increases the risk of model output quality degrading so it is important to check that when using this feature.

### Enabling Quantized KV Cache

The LLM-API exposes the quantization algorithm to be used for kv cache via the `kv_cache_quant_algo` field in `QuantConfig`. To enable fp8 kv cache, you would modify `QuantConfig` as such:

```python
quant_config = QuantConfig(quant_algo=QuantAlgo.FP8,
                           kv_cache_quant_algo=QuantAlgo.FP8)
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--kv_cache_dtype fp8` to [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization).

### Performance with Quantized KV Cache

| Metric                           | Baseline  | FP8 KV-Cache ON |
| -------------------------------- | --------- | --------------- |
| Token Throughput (tokens/sec)    | 3389.5305 | 5299.6372       |
| Request Throughput (req/sec)     | 1.6550    | 2.5877          |
| Average Time To First Token (ms) | 96.1597   | 97.1287         |
| Average Inter-Token Latency (ms) | 12.4248   | 12.5496         |

## Reduce Norm Fusion with User Buffers for Llama Models

The [Reduce Norm Fusion](./useful-build-time-flags.md#reduce-norm-fusion-plugin-for-llama-models) feature is supported for fp8. An additional optimization called "User Buffers" is also supported for fp8 models. The user buffer feature aims to eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance.


### Enabling Reduce Norm Fusion with User Buffers


To enable reduce norm fusion with user buffers, add the following lines below `BuildConfig`'s initialization

```python
build_config.plugin_config.reduce_fusion = True
build_config.plugin_config.user_buffer = True
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--reduce_fusion enable` and `--user_buffer enable` to `trtllm-build` to enable the feature.

> Note: You must have enabled `reduce_fusion` in order to enable `user_buffer`

### Performance with Reduce Norm Fusion + User Buffers:

Reduce Norm Fusion + User Buffer ON: Same engine previously referred to as FP8 KV-Cache ON.

Reduce Norm Fusion + User Buffer ON: Previous example with reduce fusion and user buffers enabled. Max-num tokens set to 16384 and max-batch size set to 512 after tuning.


| Metric                           | Reduce Norm Fusion + User Buffer OFF | Reduce Norm Fusion + User Buffer ON |
| -------------------------------- | ------------------------------------ | ----------------------------------- |
| Token Throughput (tokens/sec)    | 5299.6372                            | 5980.7842                           |
| Request Throughput (req/sec)     | 2.5877                               | 2.9203                              |
| Average Time To First Token (ms) | 97.1287                              | 82.2679                             |
| Average Inter-Token Latency (ms) | 12.5496                              | 12.6975                             |

## GEMM + SwiGLU Fusion in Gated-MLP

The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper. While this fusion improves performance, it can slightly reduce accuracy in FP8 PTQ because one quantization scaling factor is discarded.

We recommend enabling this feature for large models running on Hopper with FP8 precision.We do not recommend enabling this feature for very small workloads or if the
accuracy loss is unacceptable.

### Enabling GEMM + SwiGLU Fusion

To enable the GEMM + SwiGLU fusion, add the following lines below `BuildConfig`'s initialization

```python
build_config.plugin_config.gemm_swiglu_plugin = 'fp8'
```
For small batch size cases where latency is important, you can replace the above line with

```python
build_config.plugin_config.low_latency_gemm_swiglu_plugin = 'fp8'
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--gemm_swiglu_plugin=fp8` or `--low_latency_gemm_swiglu_plugin=fp8` for the low latency case (only include one or the other) to `trtllm-build`.

### Performance with GEMM + SwiGLU Fusion


| Metric                           | GEMM + SwiGLU fusion OFF | GEMM + SwiGLU fusion ON |
| -------------------------------- | ------------------------ | ----------------------- |
| Token Throughput (tokens/sec)    | 5980.7842                | 5976.7977               |
| Request Throughput (req/sec)     | 2.9203                   | 2.9184                  |
| Average Time To First Token (ms) | 82.2679                  | 81.8841                 |
| Average Inter-Token Latency (ms) | 12.6975                  | 11.7031                 |

In this case, the GEMM + SwiGLU plugin performs almost equivalently to when it was disabled. The throughput drop is within run to run variance and the TTFT and ITL improvements are slight. However, we found that when paired with the low latency gemm plugin discussed next, enabling this feature was necessary for getting the maximum throughput.

## Low Latency GEMM Plugin

Previously we mentioned the [GEMM Plugin](./useful-build-time-flags.md#gemm-plugin) feature. Although it has fp8 support we recommend disabling it (by default it is disabled). However for low-latency scenarios in fp8 we recommend trying the low latency GEMM plugin to see if it is effective for your workload.

### Enabling Low Latency GEMM plugin

To enable the low latency GEMM plugin, add the following lines below `BuildConfig`'s initialization

```python
build_config.plugin_config.low_latency_gemm_plugin = 'fp8'
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--low_latency_gemm_plugin=fp8` to `trtllm-build` to enable the feature. Again, **we recommend disabling the gemm plugin for fp8** so if you are passing `--gemm_plugin=fp8` to `trtllm-build` we recommend removing that.

###  Performance with Low Latency GEMM plugin

Low Latency GEMM ON: Same configuration as previous example but with low latency GEMM plugin enabled. Max num tokens was set to 16384 and max-batch size was set to 512 after tuning.

| Metric                           | Low Latency GEMM OFF | Low Latency GEMM ON |
| -------------------------------- | -------------------- | ------------------- |
| Token Throughput (tokens/sec)    | 5976.7977            | 6049.1625           |
| Request Throughput (req/sec)     | 2.9184               | 2.9537              |
| Average Time To First Token (ms) | 81.8841              | 88.0162             |
| Average Inter-Token Latency (ms) | 11.7031              | 10.8225             |

In this case, enabling the low-latency gemm plugin actually provided a meaningful boost to throughput. Additionally it also improved ITL but at the expense of TTFT. Furthermore, when used without the gemm+swiglu fusion, performance was actually worse than with out the plugin turned on. This suggests that for this workload the low-latency gemm plugin was choosing a worse kernel for the gemm right before the swiglu, but once that was handled by the gemm+swiglu fusion custom kernel, the rest of the kernels the low-latency gemm plugin was choosing was better than the baseline, resulting in improved performance. This underscores the importance of benchmarking different settings as the impact of this plugin is highly workload dependent. If possible some grid searching can be useful for extremely performance sensitive workloads

## Conclusion

Overall leveraging quantization can provide significant uplifts in performance. Here are the performance uplifts from our tuned fp8 model as compared to the tuned fp16 numbers we reached in the [previous page of guide](./tuning-max-batch-size-and-max-num-tokens.md)

| Metric                           | Tuned FP16 Model | Tuned FP8 Model | % Improvement |
| -------------------------------- | ---------------- | --------------- | ------------- |
| Token Throughput (tokens/sec)    | 2474.2581        | 6049.1625       | 144.48        |
| Request Throughput (req/sec)     | 1.2081           | 2.9537          | 144.49        |
| Average Time To First Token (ms) | 147.5742         | 88.0162         | 40.36         |
| Average Inter-Token Latency (ms) | 14.6852          | 10.8225         | 26.30         |

Additionally, compared to the fp8 baseline numbers (the baseline numbers had some degree of tuning, see [Baseline Performance](./fp8-quantization.md#fp8-baseline-performance) for details), we received the following performance uplifts from enabling the flags discussed above:

| Metric                           | Baseline FP8 Model | Tuned FP8 Model | % Improvement |
| -------------------------------- | ------------------ | --------------- | ------------- |
| Token Throughput (tokens/sec)    | 3389.5305          | 6049.1625       | 78.47         |
| Request Throughput (req/sec)     | 1.6550             | 2.9537          | 78.47         |
| Average Time To First Token (ms) | 96.1597            | 88.0162         | 8.47          |
| Average Inter-Token Latency (ms) | 12.4248            | 10.8225         | 12.90         |

As mentioned previously, the caveat with leveraging quantization are potential drops in accuracy, and we strongly recommend having a way to test whether model output quality is acceptable before attempting to use quantization. That said, many real world cases successfully use quantization and the significant performance boosts it enables are often worth the effort to see if it is a fit.

### Summary of Configuration Option Recommendations:

1. Quantized KV-cache: Typically provides significant throughput boost. We recommend turning it on as long as output quality is still acceptable with the feature enabled.
2. Reduce fusion + user buffers: This feature is only supported on fp8 Llama and Mistral/Mixtral models. Effectiveness is workload dependent so we recommend turning it on and benchmarking to check.
3. Gemm + Swiglu Plugin: This feature is only supported on fp8 models with Swiglu operators like Llama, Mixtral etc. Like reduce fusion effectiveness is workload dependent and we recommend sanity checking effectiveness. Has increased risk of affecting accuracy since it drops a quantization scale.
4. Low-Latency GEMM plugin: Effectiveness is workload dependent so we recommend turning it on and benchmarking. Effectiveness can be affected by other flags as we saw in our case study, so if possible benchmarking various combinations of configuration options is ideal.

---

While defaults are expected to provide solid performance, TensorRT-LLM has several configurable options that can improve performance for your particular workload. This guide is meant to help you tune TensorRT-LLM to extract the best performance for your use case. It covers several of the most helpful tunable parameters and provides intuition for thinking about them. This guide also doubles as an example of how to work with TensorRT-LLM's LLM-API and its TRTLLM-Bench benchmarking workflow.

This guide uses Llama-3.3-70b on 4 H100-sxm-80GB connected via NVLink as a case study and focuses on optimizing performance on input sequence length/output sequence length of 2048/2048. Case study sections throughout this guide reference internal performance testing and results to help reinforce the conclusions and recommendations given.

## Prerequisite Knowledge

This guide expects you have some familiarity with the following concepts

- Phases of Inference: Context (Prefill) Phase and Generation Phase
- Inflight Batching
- Tensor Parallelism and Pipeline Parallelism
- Quantization

 Please refer to [Mastering LLM Techniques - Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) for an introduction to these concepts.

## Table of Contents
<!--Actual table of contents with links is not written here because sphinx autogenerates it.-->

---

(tuning-max-batch-size-and-max-num-tokens)=

# Tuning Max Batch Size and Max Num Tokens

One of TensorRT-LLM's key features is it's inflight batching scheduler and runtime that handles simultaneously scheduling and executing requests in both context and generation phases. Max-Num tokens, in conjunction with Max-Batch size dictates how and when the scheduler schedules new and current requests, and understanding their role and how to tune them can provide significant performance benefits.

> Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.

## Understanding the TensorRT-LLM scheduler

This section visualizes how TensorRT-LLM schedules requests based on max-batch size and max-num tokens. The example starts out with a newly initialized engine as well as a few unscheduled requests that have come in. For the sake of this example, toy values are set to `max batch size = 4` and `max num tokens = 12`. Each square block represents a token, and its color represents which request it belongs to.

![TRT-LLM Scheduler Visualization 1](../../media/TRTLLM_Scheduler_Vis_1.svg)


Now the scheduler takes the first two requests, Request 1 and Request 2, and schedules them to execute the context phase. However, it cannot schedule any more requests because the prompts of the first two requests had 5 tokens each, leaving a budget of 2 tokens due to the max num tokens limit. Since all remaining requests have more than 2 prompt tokens none of them can be scheduled (context chunking can help in this situation, see the paged context attention section below). The tokens are marked with a "C" on them to represent that they are prompt tokens that were processed in the context phase.

> Note: The tokens for different requests are shown on different rows simply for visualization purposes and are not representative of actual memory layouts

![TRT-LLM Scheduler Visualization 2](../../media/TRTLLM_Scheduler_Vis_2.svg)

Now the engine runs an iteration of execution, completing the context phases for both of the scheduled requests. After it is done, the kv-cache of the prompts for both requests have been created and the first token has been generated. Tokens that were generated are marked with "G(n)" - for example a token marked "G1" represents that it is the first token generated for its request.

TRT-LLM prioritizes scheduling requests in generation phase first so the two generated tokens are queued to be processed in the next iteration. Now, since the two previously scheduled requests have entered generation phase and only take up two tokens out of the max num token budget of 12, the scheduler is able to schedule two additional requests, Request 3 and Request 4. It cannot schedule the last request, Request 5, even though there is space for it in the max num tokens budget because of the max batch size limit of 4.

![TRT-LLM Scheduler Visualization 3](../../media/TRTLLM_Scheduler_Vis_3.svg)

After the next iteration of execution, the second tokens for Requests 1 and 2 have been generated, and the first tokens for Request 3 and 4 have been generated. Lets say that G2 that was generated for Request 1 is the stop token, signifying that Request 1 is completed. In this case the scheduler would evict Request 1 before performing another execution iteration and prepare to return it to the user. This eviction puts the state of the engine below the max batch size limit and allows Request 5 to be scheduled.

Another thing to note is that G1 that was generated for Request 2 has been added to the kv-cache for request 2, representing how kv-cache for a request grows as more and more tokens are generated.

![TRT-LLM Scheduler Visualization 4](../../media/TRTLLM_Scheduler_Vis_4.svg)

Overall, the max batch size and max num tokens limits play a key part in deciding when requests are actually executed, and tuning them can have significant impacts on throughput numbers as well as how the engine balances previously scheduled requests in generation phase with context phase on new requests

> Note: This presents a simplified visualization of the scheduler to highlight how max batch size and max num tokens affect it. The scheduler also considers things like amount of free memory available to be used for kv-cache and has other configurable options that can affect its behavior. See the Runtime Flags of the Additional Options page for more on this.


 ## Tuning Max Batch Size

 It is important to set the max batch size large enough that it doesn't bottleneck the scheduling of new requests. Due to this it's recommended to test additional values of max batch size for your workload. The default value is 2048. Powers of 2 are good initial values to sweep through.

### How to change Max Batch Size

You can specify the max batch size in the build config.

```python
build_config = BuildConfig(
    max_batch_size=512
)
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--max_batch_size <int>` to `trtllm-build` to tune max batch size.

### Tuning Case Study

Continuing with our case study of Llama-3.3-70B on 4 H100s, we pick up where left off in the previous section (building an engine with multiple profiles, gemm plugin, paged context attention, and reduce fusion enabled). Sweeping across max batch sizes of 64, 512, and the default 2048 produced the following results

| Metric                           | Max Batch Size 64 | Max Batch Size 512 | Max Batch Size 2048 |
| -------------------------------- | ----------------- | ------------------ | ------------------- |
| Token Throughput (tokens/sec)    | 1944.3031         | 2466.7933          | 2044.2628           |
| Request Throughput (req/sec)     | 0.9494            | 1.2045             | 0.9982              |
| Average Time To First Token (ms) | 145.7607          | 147.7876           | 146.6628            |
| Average Inter-Token Latency (ms) | 14.6475           | 14.6554            | 14.4493             |

From this its clear a max batch size of 64 results in bottlenecking whereas a max batch size of 512 is the sweet spot, boosting throughput by almost 20% with negligible effect on latency.

## Tuning Max Num Tokens

If max num tokens is too small it can bottleneck request scheduling. However if it is too large it can result in prompt tokens taking up too much memory (especially in long context workloads), not leaving enough for kv-cache and resulting in performance hits or even out of memory errors. The default value of max num tokens is is 8192. It's recommended that you sweep across several values of max num tokens to identify the best number for your workload. Good values to try are powers of 2 >= 1024. Since max num tokens and max batch size both affect scheduling, grid searching across combinations of them if possible is ideal.

### How to change Max Num Tokens

Like Max Batch size, max num tokens is also specified in the build config.

```python
build_config = BuildConfig(
    max_batch_size=512
    max_num_tokens=2048
)
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--max_num_tokens <int>` to `trtllm-build` to tune max_num_tokens.

### Tuning Case Study

Sweeping across max num tokens values of 2048, 8192, and 16384 yielded the following performance numbers with max batch size set to 512.

| Metric                           | Max Num Tokens 2048 | Max Num Tokens 8192 | Max Num Tokens 16384 |
| -------------------------------- | ------------------- | ------------------- | -------------------- |
| Token Throughput (tokens/sec)    | 2474.2581           | 2466.7933           | 2461.0165            |
| Request Throughput (req/sec)     | 1.2081              | 1.2045              | 1.2017               |
| Average Time To First Token (ms) | 147.5742            | 147.7876            | 147.9623             |
| Average Inter-Token Latency (ms) | 14.6852             | 14.6554             | 14.6769              |

For this particular workload max num tokens of 2048 provides the best performance, but not by an extremely large margin. This reflects the reality that for any given workload tuning various flags can have varied impact. However it is important to check different values to ensure that you are not giving up large gains due to scheduling imbalances.

## Revisiting Paged Context Attention and Context Chunking

[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Having now understood the TensorRT-LLM scheduler we can now explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.

The [visualization](#understanding-the-trt-llm-scheduler) of the TensorRT-LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However with context chunking, this is no longer the case, and the first chunk of Request 3 would be able to be scheduled.

![TRT-LLM Scheduler Visualization Chunked Context 1](../../media/TRTLLM_Scheduler_Vis_Chunked_Context_1.svg)

This is extremely beneficial for several reasons. Firstly it eliminates the possibility of requests with large prompts relative to max num tokens being unable to be scheduled due to other requests that are already in-flight. In production workloads, this can help improve worst case TTFT numbers. Secondly it allows for setting smaller values of max num tokens since you no longer need max num tokens to be at least as large as the longest prompt you want to support. For long-context cases this is extremely important, because setting extremely large values of max-num tokens takes away from memory available to be used as kv-cache. Given that in the worst case scenario chunked context has minimal impact on performance but can significantly benefit it in many scenarios, it's recommended that you always enable it.

## Conclusion

The TensorRT-LLM Scheduler plays a large role in performance, and properly tuning it can provide significant performance uplifts. In the case-study example, tuning max batch size and max num tokens provided the following boosts to performance when compared to the [results from the previous page](./useful-build-time-flags.md#conclusion):

| Metric                           | Build-Time Flags ON | Tuned Max Batch Size and Max Num Tokens | % Improvement |
| -------------------------------- | ------------------- | --------------------------------------- | ------------- |
| Token Throughput (tokens/sec)    | 2044.2628           | 2474.2581                               | 21.03         |
| Request Throughput (req/sec)     | 0.9982              | 1.2081                                  | 21.03         |
| Average Time To First Token (ms) | 146.6628            | 147.5742                                | -0.62          |
| Average Inter-Token Latency (ms) | 14.4493             | 14.6852                                 | -1.63         |

Interpreting these results, tuning max batch size and max num tokens significantly improved throughput and remained at par on latency (the slight drops are within run to run variance). Compared to the [initial baseline](./benchmarking-default-performance.md#results) the tuned engine achieves the following uplifts:

| Metric                           | Baseline  | Build-Time Flags ON and Tuned Max Batch Size and Max Num Tokens | % Improvement |
| -------------------------------- | --------- | --------------------------------------- | ------------- |
| Token Throughput (tokens/sec)    | 1564.3040 | 2474.2581                               | 58.17         |
| Request Throughput (req/sec)     | 0.7638    | 1.2081                                  | 58.17         |
| Average Time To First Token (ms) | 147.6976  | 147.5742                                | 0.08          |
| Average Inter-Token Latency (ms) | 31.3276   | 14.6852                                 | 53.12         |

---

(useful-build-time-flags)=

# Useful Build-Time Flags

This page presents several build-time flags, set via the LLM-API's `BuildConfig` class that you can enable to improve upon the baseline performance. Build-time refers to the fact that these flags affect how the TensorRT-LLM engine is built and cannot be changed without rebuilding the engine. For each flag there is an explanation of what it does, a description of how to enable it, and then an example of running it through the benchmarking flow described in [Benchmarking Default Performance](./benchmarking-default-performance.md) to showcase its impact on performance. All options compatible with `trtllm-build` can be found in the Command Line Reference section of the docs.

> Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.

## Multiple Profiles

TensorRT-LLM is built on TensorRT, which handles engine building through "optimization profiles" defining min, optimal, and max input tensor shapes. TensorRT optimizes for the optimal shape while supporting the range between min and max.

TensorRT-LLM abstracts away the need to create optimization profiles although flags like max_batch_size and max_num_tokens (covered later) influence how they are created. By default, only one profile is created.

During inference serving, varying request loads can pose different tensor shapes to the engine. TensorRT addresses this by allowing multiple profiles, which TensorRT-LLM supports via the BuildConfig option in the LLM-API. Enabling multiple profiles increases build times but has no performance downsides, so it is recommended for production builds.

The only thing to watch out for is that enabling this can lead to slightly different outputs when the same prompt is run multiple times as different profiles and consequently kernels might be used depending on the request load. However this variance should not affect output quality so it is safe to enable this flag as long as you don't need completely deterministic outputs.

### Enabling building with multiple profiles

Below is an example of how you can modify the baseline example to enable multiple profiles.

```python
from tensorrt_llm import LLM, BuildConfig

def main():
    build_config = BuildConfig()
    build_config.plugin_config.multiple_profiles = True

    llm = LLM(
        model="/scratch/Llama-3.3-70B-Instruct",
        tensor_parallel_size=4,
        build_config=build_config
    )

    llm.save("build_flags_multiple_profiles")

if __name__ == '__main__':
    main()
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--multiple_profiles` to `trtllm-build` to enable the feature.


### Performance with multiple profiles

Baseline refers to the engine that was benchmarked in the previous Benchmarking Default Performance page.

| Metric                           | Baseline  | Multiple Profiles ON |
| -------------------------------- | --------- | -------------------- |
| Token Throughput (tokens/sec)    | 1564.3040 | 1861.0881            |
| Request Throughput (req/sec)     | 0.7638    | 0.9087               |
| Average Time To First Token (ms) | 147.6976  | 145.8958             |
| Average Inter-Token Latency (ms) | 31.3276   | 19.6452              |

As you can see, enabling multiple profiles significantly improves the metrics across the board.

## Paged Context Attention

By default all the tokens of the prompt of a new request are processed in one iteration as the context phase. Enabling paged context attention allows TensorRT-LLM to break the context phase into chunks and handle the prompt over several iterations. This is particularly useful for workloads with large input length. In the worst case, this feature can provide a small performance hit in benchmarking runs (<2%) so it can be safely enabled. This feature is discussed further in the [next page](./tuning-max-batch-size-and-max-num-tokens.md#revisiting-paged-context-attention-and-context-chunking) of the guide.


### Enabling Paged Context Attention

Add the following line to our multiple profiles example from above to enable paged context attention

```python
build_config.plugin_config.use_paged_context_fmha=True
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--use_paged_context_fmha` to `trtllm-build` to enable the feature.

### Performance

Paged Context OFF refers to the same engine shown as Multiple Profiles ON in the previous example.

| Metric                           | Paged Context OFF | Paged Context ON |
| -------------------------------- | ----------------- | ---------------- |
| Token Throughput (tokens/sec)    | 1861.0881         | 1866.6684        |
| Request Throughput (req/sec)     | 0.9087            | 0.9115           |
| Average Time To First Token (ms) | 145.8958          | 145.4089         |
| Average Inter-Token Latency (ms) | 19.6452           | 19.6523          |

In this case enabling paged context attention provides a small boost to performance, but a rerun of our tests found this to be within run to run variance of around 10 tok/s for token throughput and 2ms for average time to first token (ITL was stable with <1ms and request throughput corresponded directly to token throughput). In other cases naively enabling it might actually provide a small hit to performance. However, further guidance on how to reason about this flag and why we recommend enabling it is discussed in the [next page](./tuning-max-batch-size-and-max-num-tokens.md#revisiting-paged-context-attention-and-context-chunking) as it is closely intertwined with how TensorRT-LLM schedules requests as well as the max-num tokens flag.

## GEMM Plugin

TensorRT allows you to add "plugins" or custom kernels that can be used instead of the kernels that TensorRT selects for particular operations. TensorRT-LLM has a host of custom plugins that are specifically tailored to speed up supported modules. The GEMM plugin utilizes NVIDIA cuBLASLt and some custom kernels to perform GEMM operations. On FP16 and BF16, it’s recommended to be enabled for better performance and smaller GPU memory usage. On FP8, it’s recommended to be disabled.

### Enabling GEMM Plugin

Add the following line to the multiple profiles example from above to enable paged context attention.

```python
build_config.plugin_config.gemm_plugin = 'auto'
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--gemm_plugin auto` to `trtllm-build` to enable the feature. `'auto'` tells the GEMM plugin to have the same type as the model (fp16, bf16, etc). It is fine to leave it on auto unless you are trying to do mixed precision.

### Performance with GEMM Plugin

GEMM Plugin OFF refers to the same engine shown as Paged Context ON in the previous example.

| Metric                           | GEMM Plugin OFF | GEMM Plugin ON |
| -------------------------------- | --------------- | -------------- |
| Token Throughput (tokens/sec)    | 1866.6684       | 2033.2640      |
| Request Throughput (req/sec)     | 0.9115          | 0.9928        |
| Average Time To First Token (ms) | 145.4089        | 147.8307       |
| Average Inter-Token Latency (ms) | 19.6523         | 15.4133        |

In this case the GEMM plugin greatly improves throughput as well as ITL, with a slight hit to TTFT.

## Reduce Norm Fusion Plugin for Llama models:

TensorRT-LLM has custom kernels for AllReduce operations that are enabled by default. This feature extends this functionality by fusing the ResidualAdd and LayerNorm kernels that run after AllReduce into the AllReduce kernel, resulting in a single kernel that handles those operations and improves end-to-end performance. This feature is currently only available for Llama models. It is most beneficial in workloads that are generation-phase heavy. For extremely context-phase heavy workloads its worth checking performance with and without this. Additionally, since this is an optimization for AllReduce, it is only beneficial for cases with tensor-parallelism. For scenarios only using pipeline parallelism this should stay disabled since pipeline parallelism doesn't require any AllReduce operations.

### Enabling Reduce Norm Fusion Plugin

Add the following line to the multiple profiles example from above to enable paged context attention.

```python
build_config.plugin_config.reduce_fusion = True
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) pass `--reduce_fusion enable` to `trtllm-build` to enable the feature.

### Performance with Reduce Norm Fusion

Reduce Fusion OFF refers to the same engine shown as GEMM Plugin ON in the previous example.

| Metric                           | REDUCE FUSION OFF | REDUCE FUSION ON |
| -------------------------------- | ----------------- | ---------------- |
| Token Throughput (tokens/sec)    | 2033.2640        | 2044.2628        |
| Request Throughput (req/sec)     | 0.9928            | 0.9982           |
| Average Time To First Token (ms) | 147.8307          | 146.6628         |
| Average Inter-Token Latency (ms) | 15.4133           | 14.4493          |

For the ISL/OSL pair of 2048/2048 enabling the reduce norm fusion plugin slightly improves performance all around. However, test reruns found that with run to run variance, in the worst case, they performed at par. Again this flag's effectiveness is dependent on the workload so users should check whether it provides meaningful performance boosts in their case.


## Pipeline Parallel Reduce Scatter Optimization

This feature adds a pipeline parallelism optimization with ReduceScatter + AllGather targeting large mixture of experts models.
This can be enabled via the LLM-API as such
```python
    build_config.plugin_config.pp_reduce_scatter = True
```

If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow you can enable this feature by adding `--pp_reduce_scatter` to `trtllm-build`.

As the Llama model is not a MoE model this flag was not included as part of the case study.

## Conclusion

Overall, enabling these flags can greatly boost performance. However, the degree to which they are effective can vary from workload to workload, and it's recommended that you run sanity checks on your workloads to verify performance.

The case-study example showed that enabling these flags provided the following performance uplifts from the baseline numbers. This included significant boosts in Token Throughput, Request Throughput, and Average Inter-Token Latency. TTFT remained largely unchanged.

| Metric                           | Baseline  | Build-Time Flags ON | % Improvement |
| -------------------------------- | --------- | ------------------- | ------------- |
| Token Throughput (tokens/sec)    | 1564.3040 | 2044.2628           | 30.68         |
| Request Throughput (req/sec)     | 0.7638    | 0.9982              | 30.69         |
| Average Time To First Token (ms) | 147.6976  | 146.6628            | 0.70          |
| Average Inter-Token Latency (ms) | 31.3276   | 14.4493             | 53.88         |

### Summary of Configuration Option Recommendations:

1. Multiple profiles: Always enable. It may increase build times a little but will only ever help performance. Enabling might cause engine to produce slightly different outputs when the same prompt is run multiple times depending on request load but it should not affect output quality, see [Multiple Profiles section](./useful-build-time-flags.md#multiple-profiles) for explanation.
2. Paged Context Attention: In the worst case it may hurt performance a little initially but typically helps with request scheduling and boosts performance after further tuning of max batch size and max num tokens. More on this topic is discussed in the next page.
3. GEMM Plugin: It's recommended to enable it for FP16 and BF16 models as it usually helps. However, it is a good idea to benchmark your workload and double check that it is helping.
4. Reduce Fusion: This feature is only supported on Llama and Mistral/Mixtral models. Effectiveness is workload dependent and it's recommend that you benchmark your workload with and without it and compare the results.

---

(useful-runtime-flags)=

# Useful Runtime Options

This part summarizes the runtime configuration knobs that can be tweaked to
enhance the performance of already built engines. As compared to previous examples where
 the LLM-API was used to build and save an engine but not to process any requests,
runtime knobs would be specified when you are using the LLM-API to actually run inference
like in the [LLM-API end-to-end example](./benchmarking-default-performance.md#before-you-begin-tensorrt-llm-llm-api)


## Capacity Scheduler Policy

TensorRT-LLM currently supports three batch scheduler policies: `GUARANTEED_NO_EVICT` (default),
`MAX_UTILIZATION` and `STATIC_BATCH`.

The scheduling policy can be set to `MAX_UTILIZATION` to pack as many
requests as possible at each iteration of the forward loop, when in-flight
sequence batching is enabled. It maximizes the utilization of the GPUs by
aggressively scheduling requests at the risk of having to pause requests if the
KV cache size limit is reached.

For a more conservative approach with respect to the KV cache limitations in
terms of memory allocation, `CapacitySchedulerPolicy` should be set to
`GUARANTEED_NO_EVICT` to guarantee that a started request is never paused.

If the goal is to maximizes the throughput, users should try `MAX_UTILIZATION`.
However, they need to keep in mind that it may have a negative impact on
latency if requests have to be paused.

`STATIC_BATCH` is a legacy mode and is not recommended for production usage.

To switch the capacity scheduler policy from the default of `GUARANTEED_NO_EVICT` to `MAX_UTILIZATION`
you would modify the [LLM-API end-to-end example](./benchmarking-default-performance.md#before-you-begin-tensorrt-llm-llm-api) to be:

```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import SchedulerConfig, CapacitySchedulerPolicy


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    scheduler_config = SchedulerConfig(
        capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
    )

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,
    scheduler_config=scheduler_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()
```

## Context Chunking Policy

As discussed [previously](tuning-max-batch-size-and-max-num-tokens.md#revisiting-paged-context-attention-and-context-chunking) context chunking will increase the chance of batch processing between
the context and the generation phase, thereby balancing the calculation amount
of each iteration and typically increasing throughput.

TensorRT-LLM currently supports two context chunking policies: `FIRST_COME_FIRST_SERVED` (default) which would prioritize scheduling all the context chunks of a request that comes in first,
 and `EQUAL_PROGRESS` which schedules context chunks from all requests before scheduling the next chunk of any request.

`FIRST_COME_FIRST_SERVED` should achieve overall better performance, while
`EQUAL_PROGRESS` can be helpful in theory to make sure time to first token (TTFT)
for most requests are relatively similar.

To switch the context chunking policy from the default of `FIRST_COME_FIRST_SERVED` to `EQUAL_PROGRESS`
you would modify the [LLM-API end-to-end example](./benchmarking-default-performance.md#before-you-begin-tensorrt-llm-llm-api) to be:

```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import SchedulerConfig, ContextChunkingPolicy


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    scheduler_config = SchedulerConfig(
        context_chunking_policy=ContextChunkingPolicy.EQUAL_PROGRESS
    )

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,
    scheduler_config=scheduler_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()
```

## Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction

The `max_tokens_in_paged_kv_cache` and `kv_cache_free_gpu_mem_fraction`
parameters can be used to control the maximum number of tokens handled by the
KV cache manager. Setting them properly helps better control the amount of
available memory for the KV cache manager during inference. Keeping in mind
that increasing the amount of memory available to the KV cache manager tends to
translate to a higher achievable throughput.

The `max_tokens_in_paged_kv_cache` flag directly sets the maximum number of
tokens in the KV cache manager. When left unset, that value will be computed
based on the `kv_cache_free_gpu_mem_fraction` setting.

The `kv_cache_free_gpu_mem_fraction` is a floating-point number between `0.0`
and `1.0` that indicates the maximum fraction of GPU memory (after loading the
model) that will be used for the KV cache. The default value is `0.90` and
means that 90% of the free GPU memory will be used to save tokens in the KV
cache. Based on that value, TensorRT-LLM can determine the maximum number of
tokens in the KV cache manager.

When both parameters are set, the maximum number of tokens in the KV cache
manager will be set to the smaller value between `max_tokens_in_paged_kv_cache`
and the value computed from the amount of memory available for the KV cache.

Unless users clearly know the maximum number of tokens in the KV cache needed
by the model, it is recommended to leave `max_tokens_in_paged_kv_cache` unset.
For `kv_cache_free_gpu_mem_fraction`, if no other programs are executed on the
same GPU, it is recommended to test with a as high value as `0.95` to target a
high throughput. Note that the `kv_cache_free_gpu_mem_fraction` parameter
cannot be set to `1.0` because some amount of memory has to be reserved for
inputs and outputs.

To set `kv_cache_free_gpu_mem_fraction` you would modify the [LLM-API end-to-end example](./benchmarking-default-performance.md#before-you-begin-tensorrt-llm-llm-api) to be:

```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import KvCacheConfig


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.95)

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=8,
    kv_cache_config=kv_cache_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()
```
If you wanted to set `max_tokens_in_paged_kv_cache` instead, you would replace `free_gpu_memory_fraction` with `max_tokens` and specify the number.

```python
    kv_cache_config = KvCacheConfig(max_tokens=<number of tokens>)
```


## Maximum Attention Window Size

The `max_attention_window_size` flag sets the maximum number of tokens that are
attended to in order to generate one token when using techniques like sliding window
attention. See this
[Document](../../advanced/gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache)
for more details. It defaults to the maximum sequence length
(`max_seq_len` when building the engine), which means
that the feature is disabled by default.

When set to a smaller value than `max_seq_len` (during
engine build), only the KV cache of the last `max_attention_window_size` tokens
will be stored. If the input sequence length at runtime exceeds the
`max_attention_window_size` value, the accuracy may start dropping, but the
runtime performance will be better (due to the reduction in terms of
computations and GPU memory allocation). Users can modify that value to
increase runtime performance at the expense of reduced accuracy.

Just like [`kv_cache_free_gpu_mem_fraction`](./useful-runtime-flags.md#max-tokens-in-paged-kv-cache-and-kv-cache-free-gpu-memory-fraction), `max_attention_window_size` can be specified in the LLM-API
via `KVCacheConfig`. To specify `max_attention_window_size` you would instantiate `KVCacheConfig` like so

```python
    kv_cache_config = KvCacheConfig(max_attention_window=<number of tokens>)
```

---

(memory)=

# Memory Usage of TensorRT-LLM


This document summarizes the memory usage of TensorRT-LLM, and addresses common issues and questions reported by users.


## Understand inference time GPU memory usage


At inference time, there are 3 major contributors to GPU memory usage for a given TRT engine generated from a TensorRT-LLM model: weights, internal activation tensors, and I/O tensors. For I/O tensors, the major memory footprint comes from the KV cache tensor.


### 1. Weights size

Weights size is fixed depending on the model size, the chosen precision of the weights and the parallelization strategy.
Using lower precision like INT8 or FP8 can reduce the weights size.
When tensor parallelism or pipeline parallelism is used, each rank stores only some portion of the weights.
For example, each rank typically uses just 1/8 of the model weights when using 8-way tensor parallelism or 8-stages pipeline parallelism.


### 2. Activation size


TensorRT can optimize the memory usage by reusing memory for different tensors based on live analysis and tensor size. To avoid out of memory errors at runtime and to reduce the runtime cost of switching optimization profiles and changing shapes, **TensorRT pre-computes the activation tensors memory requirement at build time**. The memory requirement is computed based on an optimized TensorRT graph, one profile’s memory usage is computed by using the max tensor shape, and the memory requirement of one engine is computed by the maximum size between different profiles. There are external and internal factors that can affect the activation size returned by TensorRT, such as the network structure, kernel fusion, operation scheduling, etc.

Once the TensorRT engine is built, the activation memory size of that engine **cannot be changed**, and can be queried by the API `trt.ICudaEngine.device_memory_size_v2`.

Practically, for a given model, specified precision and parallelization strategy, one can tune the activation memory usage by adjusting the max batch size, max input length, max beam width, max number of tokens, padding removal on/off flag, context FMHA on/off flag.
Here some explanations on how these values affect the memory:


1. Reduce build time max number of input tokens (`max_num_tokens`)

   Most of the tensors inside a transformer network have a linear relationship with number of input tokens, so activation size will be close to `max number of input tokens * some constant factor`, the constant factor depends on the network structure and TRT internal optimization. The max number of input tokens is derived from build time arguments, one can change the parameters provided to the `prepare_inputs` function, like `PretrainedModel.prepare_inputs` to affect the memory usage, or one can change the command line options of the `trtllm-build` command used in the examples.

   When using the [packed tensors](../advanced/gpt-attention.md#padded-and-packed-tensors) format and `max_num_tokens` is specified, reducing its value will also reduce activation memory size.

   When using the [padded tensors](../advanced/gpt-attention.md#padded-and-packed-tensors) format, the max number of input tokens equals to `max_batch_size*max_input_len`, so reducing `max_batch_size` and `max_input_len` can almost linearly reduce the activation memory size.

   The packed tensors format is recommended, because it saves both memory and compute.

   The beam width will be folded into the batch size dimension when passing the tensors range into TensorRT, so reducing `max_beam_width` can also reduce the memory usage.


2. Turn on context FMHA

	When the GPT attention plugin is used, turning on the `context_fmha_type` of the plugin will reduce the memory footprint significantly. See the [Context Phase](../advanced/gpt-attention.md#context-phase) for details. When the `context_fmha_type` is set to disabled, a workspace size of the plugin will quadratically depend on the sequence length.


3. Tensor parallelism and pipeline parallelism

   TensorRT will reuse memory between layers as much as possible, for a typical example, given *N* decoder blocks in one transformer network, TRT will not allocate *N* copies of the activation memory for each block, since the memory of tensors in the 1st block can be released after the execution, memory can be reused for later blocks, only 1 block’s memory is needed.


   When using tensor parallelism, some tensors are split into smaller chunks and each rank only holds one chunk of the tensor, the activation memory size of each rank will be smaller than when executing the network on a single GPU. When using pipeline parallelism, each rank executes several decoder blocks, and all the tensors are full-size tensors, so the activation memory size is equal to 1 block’s memory size. Thus tensor parallelism normally has higher memory efficiency than pipeline parallelism when all other parameters are the same.


### 3. I/O tensors

#### 3.1 Runtime and decoder buffers except KV cache tensor

##### C++ runtime

Before KV cache blocks are allocated, some amount of GPU memory are pre-allocated by C++ runtime for storing I/O tensors of TensorRT engine and the decoupled dynamic decoder, it's allocated based on runtime max_batch_size and max_seq_len so that OOM can be avoided when there are indeed that amount of requests scheduled.

#### 3.2 KV cache tensor

##### C++ runtime

   TensorRT-LLM runtime pre-allocates paged KV cache pools during initialization for a configured number of blocks and distributes them at runtime.

   KV cache tensors are allocated based on the `KVCacheConfig` object when creating the `Executor`. If neither `maxTokens` nor `freeGpuMemoryFraction` is specified, KV cache will by default allocate 90% of the remaining free GPU memory. When either `maxTokens` or `freeGpuMemoryFraction` is specified, the specified value will be used to compute the KV cache memory size. And if both are specified, firstly the `freeGpuMemoryFraction` is used to compute the number of tokens in KV cache, and then the minimum between this computed number of tokens and `maxTokens` is used.

   In in-flight batching the scheduler can automatically schedule requests as long as enough KV cache space is available (exact behavior depends on the scheduler policy).

##### Python runtime (Not recommended to be used)

The Python runtime allocates KV cache tensors based on the parameters of the `GenerationSession.setup` function, the KV cache size is linearly dependent on the `batch_size` and `max_context_length+max_new_tokens`. **Note: This may change in the future, as the Python bindings of the C++ runtime may replace the current python runtime in the future. The Python bindings of C++ runtime behave like C++ runtime.**

## Memory pool

TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `TrtGptModel` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `TrtGptModel` object. Memory will be released from the pool if it is required for other memory allocations.

However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.

## Known Issues


When FP8 GEMM is used, the activation memory might be larger than the theoretical optimized memory size, this will be enhanced in a future release.

## FAQ

1. How to debug the memory usage of TensorRT-LLM?

   When the `info` logging level is used, TensorRT and TensorRT-LLM will print messages about memory usage details. Here is part of a log example with `info` logging level at runtime:
   ```
   [TensorRT-LLM][INFO] Loaded engine size: 6695 MiB
   [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1134.01 MiB for execution context memory.
   [TensorRT-LLM][INFO] [MS] Running engine with multi stream info
   [TensorRT-LLM][INFO] [MS] Number of aux streams is 1
   [TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
   [TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
   [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6678 (MiB)
   [TensorRT-LLM][INFO] [MemUsageChange] Allocated 43.29 MB GPU memory for runtime buffers.
   [TensorRT-LLM][INFO] [MemUsageChange] Allocated 180.30 MB GPU memory for decoder.
   [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.10 GiB, available: 70.48 GiB
   [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4060
   [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
   [TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
   [TensorRT-LLM][INFO] Number of tokens per block: 64.
   [TensorRT-LLM][INFO] [MemUsageChange] Allocated 63.44 GiB for max tokens in paged KV cache (259840).
   ```
   You can see that there are several GPU memory allocation started with `[MemUsageChange]` keyword happened at runtime.

   The line showing "Total Weights Memory" indicates the weights memory size, and the line "Total Activation Memory" indicates the activation memory size.

   Normally the weights memory size is close to the TensorRT engine size, since most of the content in the engine is from weights for LLM networks.

2. Why is the memory size large even though a small batch size and sequence length are used in the runtime?

   As explained above, the activation memory size is computed based on the max tensor shapes at TensorRT engine building time, try to reduce the engine building time parameters like `max_num_token`, see [Activation size](#activation-size) for details.


3. Why can the engine be generated, but the inference will run out of memory (OOM) at runtime?

   At engine building time, TensorRT will tune the kernel selection layer by layer, it does not necessarily allocate all the memory required to run the entire engine. If the activation tensors required to run a single layer are small, while the I/O tensor (like KV cache) sizes required to run the engine are large, building will succeed since it may not need to allocate the large I/O tensors, runtime may fail with OOM errors on allocating large IO tensors.

   TensorRT-LLM has provided a `check_gpt_mem_usage` utility function to check the upper bound of the memory size given an engine, and the related batch size, I/O sequence length, etc., when the upper boundary check exceeded the GPU physical memory size, warning messages will be printed.

4. For pipeline parallelism, is build time max batch size the limit of micro batch size?

   Yes, in pipeline parallel mode, TensorRT-LLM runtime will split the batch of requests into micro batches, and enqueue these micro batches into TRT engine sequentially.

   The `max_batch_size` at build time means that batch size of one engine enqueue call shall be smaller than it. The total batch size before splitting into micro batches can be larger than the build time `max_batch_size`.

   For example, if you have 4-stages pipeline parallelism, and intend to run the engine using micro batch size 2 and run 16 micro batches (total batch size 32) in one `generate` call.

   You could just set the `max_batch_size` at building time to 2, instead of 32. Setting build time `max_batch_size` 32 will occupy almost 16x more activation memory.

---

# Multimodal Feature Support Matrix (PyTorch Backend)

| Model              | CUDA Graph | Encoder IFB         | KV Cache Reuse | Chunked Prefill |
| :----------------- | :--------- | :------------------ | :------------- | :-------------- |
| Gemma 3            | Yes        | Yes                 | N/A            | N/A             |
| HyperCLOVA         | Yes        | Yes                 | No             | No              |
| VILA               | Yes        | No                  | No             | No              |
| LLaVA-NeXT         | Yes        | Yes                 | Yes            | Yes             |
| Llama 4            | Yes        | Yes                 | No             | No              |
| Mistral-Small-3.1  | Yes        | Yes                 | Yes            | Yes             |
| Phi-4-multimodal   | Yes        | Yes                 | Yes            | Yes             |
| Qwen2-VL           | Yes        | Yes                 | Yes            | Yes             |
| Qwen2.5-VL         | Yes        | Yes                 | Yes            | Yes             |

---

(precision)=

# Numerical Precision

This document describes the different quantization recipes implemented in TensorRT-LLM and contains a support matrix
for the different models.

## FP32, FP16 and BF16

The different models implemented in TensorRT-LLM work with 32-bit IEEE
floating-point (FP32) numbers. When checkpoints are available, the models also
support 16-bit IEEE floating-point numbers (FP16) and 16-bit Bfloat16 (BF16) as
described [here](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format).

## Quantization and Dequantization (Q/DQ)

Given a floating-point number `x` and a floating-point scaling factor `s`,
TensorRT-LLM implements INT8 quantization as:

```
q = int8.satfinite(x * s)
```

Given an INT8 number `q` and a floating-point scaling factor `s`, TensorRT-LLM
implements INT8 dequantization to the floating-point (FP) type as:

```
x = static_cast<FP>(q) * s
```

Given a matrix (2D tensor) of shape `M x N` (`M` rows and `N` columns) where
`M` is the number of tokens and `N` is the number of channels. TensorRT-LLM has
the three following modes to quantize and dequantize the elements of the
tensor:

 * Per-tensor: It uses a single scaling factor for all the elements,
 * Per-token: It uses a different scaling factor for each token. There are `M`
   scaling factors in that case,
 * Per-channel: It uses a different scaling factor for each channel. There are
   `N` scaling factors in that case.

Note that per-token and per-channel scaling modes can be used together (i.e.
they are _not_ mutually exclusive).

In pseudo-code, the quantization can be implemented as follows for the three
different modes:

```python
# Per-tensor scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s)

# Per-token scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s[mi])

# Per-channel scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s[ni])
```

## INT8 SmoothQuant (W8A8)

The SmoothQuant technique was introduced in
[https://arxiv.org/abs/2211.10438](https://arxiv.org/abs/2211.10438). It is a
method to run inference using INT8 for both activations and weights while
maintaining the accuracy of the network (on downstream tasks).

As explained in the research paper, preprocessing must be applied to the
weights of the model. TensorRT-LLM includes scripts to prepare the model to
run using the SmoothQuant method.

Examples of how to enable SmoothQuant for GPT, GPT-J and LLaMA can be found in
the [examples/quantization](source:examples/quantization) folder of that release.

## INT4 and INT8 Weight-Only (W4A16 and W8A16)

The INT4 and INT8 Weight-Only techniques consist in quantizing the weights of
a model and dequantizing those weights on-the-fly in linear layers (Matmuls).
The activations are encoded using floating-point values (FP16 or BF16).

To use INT4/INT8 Weight-Only methods, the user must determine the scaling
factors to use to quantize and dequantize the weights of the model.

This release includes examples for [GPT](source:examples/models/core/gpt) and
[LLaMA](source:examples/models/core/llama).

## GPTQ and AWQ (W4A16)

The GPTQ and AWQ techniques are presented in
[https://arxiv.org/abs/2210.17323](https://arxiv.org/abs/2210.17323)
and
[https://arxiv.org/abs/2306.00978](https://arxiv.org/abs/2306.00978),
respectively. TensorRT-LLM supports per-group scaling factors and
zero-offsetting in linear layers to implement GPTQ and AWQ methods. See the
[WeightOnlyGroupwiseQuantMatmulPlugin](source:cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin)
plugin and the corresponding
[`weight_only_groupwise_quant_matmul`](source:tensorrt_llm/quantization/functional.py)
Python function, for details.

This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
[GPT-J](source:examples/models/contrib/gptj).

## FP8 (Hopper)

This release of TensorRT-LLM contains implementations of FP8 for GPT-NeMo,
GPT-J and LLaMA. Those examples can be found in
[examples/quantization](source:examples/quantization).

## NVFP4 (Blackwell)

LLama and Mixtral can run in NVFP4 datatype. Those examples can be found in Llama examples.

## Support matrix

This release of TensorRT-LLM contains the following examples:

| Model          | FP32  | FP16  | BF16  |  FP8  | NVFP4 | W8A8 SQ | W8A16 | W4A16 | W4A16 AWQ | W4A16 GPTQ |
| :------------- | :---: | :---: | :---: | :---: | :---: | :-----: | :---: | :---: | :-------: | :--------: |
| Baichuan       |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     Y      |
| BERT           |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| BLIP-2         |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| BLOOM          |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     .     |     .      |
| ChatGLM        |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| ChatGLM-v2     |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| ChatGLM-v3     |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| DBRX           |   Y   |   Y   |   Y   |   .   |   .   |    .    |   Y   |   Y   |     .     |     .      |
| Falcon         |   Y   |   Y   |   Y   |   Y   |   .   |    .    |   Y   |   Y   |     Y     |     .      |
| Flan-T5        |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| Gemma          |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     .      |
| GPT            |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     .     |     .      |
| GPT-J          |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     .      |
| GPT-NeMo       |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| GPT-NeoX       |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     Y      |
| InternLM       |   Y   |   Y   |   Y   |   .   |   .   |    Y    |   Y   |   Y   |     .     |     .      |
| InternLM2      |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| LLaMA          |   Y   |   Y   |   Y   |   Y   |   Y   |    Y    |   Y   |   Y   |     Y     |     Y      |
| LLaMA-v2       |   Y   |   Y   |   Y   |   Y   |   Y   |    Y    |   Y   |   Y   |     Y     |     Y      |
| Mamba          |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| Mistral        |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     .      |
| Mixtral        |   Y   |   Y   |   Y   |   Y   |   Y   |    .    |   Y   |   Y   |     .     |     .      |
| MPT            |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     .      |
| OPT            |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| Phi            |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| Qwen           |   Y   |   Y   |   Y   |   .   |   .   |    Y    |   Y   |   Y   |     Y     |     Y      |
| RecurrentGemma |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   .   |   .   |     Y     |     .      |
| Replit Code    |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| SantaCoder     |   Y   |   Y   |   Y   |   .   |   .   |    .    |   Y   |   Y   |     .     |     .      |
| Skywork        |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| StarCoder1     |   Y   |   Y   |   Y   |   .   |   .   |    .    |   Y   |   Y   |     .     |     .      |
| StarCoder2     |   Y   |   Y   |   Y   |   Y   |   .   |    .    |   Y   |   Y   |     .     |     .      |
| T5             |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| Whisper        |   Y   |   Y   |   Y   |   .   |   .   |    .    |   Y   |   Y   |     .     |     .      |
| BLIP2-OPT      |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| BLIP2-T5       |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |
| LLaVA          |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     Y      |
| VILA           |   Y   |   Y   |   Y   |   Y   |   .   |    Y    |   Y   |   Y   |     Y     |     Y      |
| Nougat         |   Y   |   Y   |   Y   |   .   |   .   |    .    |   .   |   .   |     .     |     .      |

Note: The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default.
The language component decides which quantization methods are supported by a given multi-modal model.

## Technical Detail: The `QuantMode` Flags

The quantization method is controlled by the
[`QuantMode`](source:tensorrt_llm/quantization/mode.py) flags. The different fields
are:

 * `INT4_WEIGHTS`, the weights are quantized to 4 bits (W4A\*),
 * `INT8_WEIGHTS`, the weights are quantized to 8 bits (W8A\*),
 * `ACTIVATIONS`, the activations are quantized to 8 bits (W\*A8),
 * `PER_CHANNEL`, the scaling factors are defined per channel,
 * `PER_TOKEN`, the scaling factors are defined per token,
 * `PER_GROUP`, the scaling factors are defined per group.

There are three additional flags to control TensorRT-LLM:

 * `INT8_KV_CACHE`, the K/V cache stores K and V using 8-bit integers,
 * `FP8_KV_CACHE`, the K/V cache stores K and V using 8-bit floating-point numbers,
 * `FP8_QDQ`, TensorRT-LLM relies on automatic fusion of Q/DQ nodes in TensorRT.

---

# Support Matrix

TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.

## Models (PyTorch Backend)

| Architecture | Model | HuggingFace Example | Modality |
|--------------|-------|---------------------|----------|
| `BertForSequenceClassification` | BERT-based | `textattack/bert-base-uncased-yelp-polarity` | L |
| `DeciLMForCausalLM` | Nemotron | `nvidia/Llama-3_1-Nemotron-51B-Instruct` | L |
| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3 `| L |
| `Exaone4ForCausalLM` | EXAONE 4.0 | `LGAI-EXAONE/EXAONE-4.0-32B` | L |
| `Gemma3ForCausalLM` | Gemma 3 | `google/gemma-3-1b-it` | L |
| `Gemma3ForConditionalGeneration` | Gemma 3 | `google/gemma-3-27b-it` | L + I |
| `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | L + I |
| `LlavaLlamaModel` | VILA | `Efficient-Large-Model/NVILA-8B` | L + I + V |
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT | `llava-hf/llava-v1.6-mistral-7b-hf` | L + I |
| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B` | L |
| `Llama4ForConditionalGeneration` | Llama 4 | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | L + I |
| `MistralForCausalLM` | Bielik | `speakleash/Bielik-11B-v2.2-Instruct` | L |
| `MistralForCausalLM` | Mistral | `mistralai/Mistral-7B-v0.1` | L |
| `Mistral3ForConditionalGeneration` | Mistral3 | `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | L + I |
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | L |
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | L |
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` | L |
| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | L |
| `Phi3ForCausalLM` | Phi-4  | `microsoft/Phi-4` | L |
| `Phi4MMForCausalLM` | Phi-4-multimodal | `microsoft/Phi-4-multimodal-instruct` | L + I + A |
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | L |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B` | L |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
| `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + I + V |
| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + I + V |
| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B` | L |
| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B` | L |

Note:
- L: Language
- I: Image
- V: Video
- A: Audio

## Models (TensorRT Backend)

### LLM Models

- [Arctic](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/arctic)
- [Baichuan/Baichuan2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/baichuan)
- [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
- [BLOOM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/bloom)
- [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [ChatGLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm-6b)
- [ChatGLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm2-6b)
- [ChatGLM3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm3-6b-32k)
- [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [DBRX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/dbrx)
- [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone)
- [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/falcon)
- [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) [^encdec]
- [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma)
- [GLM-4](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/glm-4-9b)
- [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gptj)
- [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gptneox)
- [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/granite)
- [Grok-1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/grok)
- [InternLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples//models/contrib/internlm)
- [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/internlm2)
- [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mamba)
- [mBART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Minitron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
- [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mixtral)
- [MPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt)
- [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/opt)
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi)
- [Qwen/Qwen1.5/Qwen2/Qwen3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl)
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma)
- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt) [^replitcode]
- [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
- [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [Skywork](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/skywork)
- [Smaug](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/smaug)
- [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/whisper)


### Multi-Modal Models [^multimod]

- [BLIP2 w/ OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [BLIP2 w/ T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [CogVLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal) [^bf16only]
- [Deplot](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Fuyu](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Kosmos](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVA-v1.5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVa-Next](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVa-OneVision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Nougat](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Phi-3-vision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Video NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [VILA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [MLLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLama 3.2 VLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)


(support-matrix-hardware)=
## Hardware

The following table shows the supported hardware for TensorRT-LLM.

If a GPU architecture is not listed, the TensorRT-LLM team does not develop or test the software on the architecture and support is limited to community support.
In addition, older architectures can have limitations for newer software releases.

```{list-table}
:header-rows: 1
:widths: 20 80

* -
  - Hardware Compatibility
* - Operating System
  - TensorRT-LLM requires Linux x86_64 or Linux aarch64.
* - GPU Model Architectures
  -
    - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
    - [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
    - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
    - [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
    - [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
    - [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)
    - [NVIDIA Ada Lovelace Architecture](https://www.nvidia.com/en-us/technologies/ada-architecture/)
    - [NVIDIA Ampere Architecture](https://www.nvidia.com/en-us/data-center/ampere-architecture/)
```

(support-matrix-software)=
## Software

The following table shows the supported software for TensorRT-LLM.

```{list-table}
:header-rows: 1
:widths: 20 80

* -
  - Software Compatibility
* - Container
  - [25.10](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
* - TensorRT
  - [10.13](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html)
* - Precision
  -
    - Blackwell (SM100/SM103/SM120) - FP32, FP16, BF16, FP8, FP4, INT8, INT4
    - Hopper (SM90) - FP32, FP16, BF16, FP8, INT8, INT4
    - Ada Lovelace (SM89) - FP32, FP16, BF16, FP8, INT8, INT4
    - Ampere (SM80, SM86) - FP32, FP16, BF16, INT8, INT4[^smgte89]
```

[^replitcode]: Replit Code is not supported with the transformers 4.45+.

[^smgte89]: INT4 AWQ and GPTQ with FP8 activations require SM >= 89.

[^encdec]: Encoder-Decoder provides general encoder-decoder functionality that supports many encoder-decoder models such as T5 family, BART family, Whisper family, NMT family, and so on.

[^multimod]: Multi-modal provides general multi-modal functionality that supports many multi-modal architectures such as BLIP2 family, LLaVA family, and so on.

[^bf16only]: Only supports bfloat16 precision.


```{note}
Support for FP8 and quantized data types (INT8 or INT4) is not implemented for all the models. Refer to {ref}`precision` and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder for additional information.
```

---

(troubleshooting)=

# Troubleshooting

This document describes some of the frequently asked questions and their solutions in TensorRT-LLM, including problems of installation, model-building, model-execution, and input / output size.

## Installation Errors

During compilation and installation of TensorRT-LLM, many build errors can be resolved by simply deleting the build tree and rebuilding again.

In most occasions, these problems are caused by the workflow like: an old compilation -> some code change (update of the repo or users' writing) -> a later compilation.

Solution: try running build script with `--clean`, or try running `rm -r build cpp/build` before running build script again.

## Debug on Unit Tests

Here is an example to print the values of the MLP output tensor in the a unit test ([full example](../../../tests/test_debugging_api.py)).

1. Register the intermediate tensors as the network outputs with `register_network_output` API.

```python
class MLP(Module):

    def __init__(self, ...):
        super().__init__()
        # Do not modify the definition in `__init__` method
        self.fc = ...
        self.proj = ...

    def forward(self, hidden_states):
        inter = self.fc(hidden_states)
        inter = tensorrt_llm.functional.relu(inter)
        # Here register the tensor `inter` as our debug output tensor
        self.register_network_output('inter', inter)
        output = self.proj(inter)
        return output
```

2. Mark the intermediate tensors as network outputs.

```python
for k, v in gm.named_network_outputs():
    net._mark_output(v, k, dtype)
```

3. Print the tensors at runtime.

```python
print(outputs.keys())
print(outputs['inter'])
```

## Debug on E2E Models

Here is an example to print the values of the MLP output tensor in the GPT model.

1. Register the MLP output tensor in `tensorrt_llm/models/gpt/model.py`.

```python
        hidden_states = residual + attention_output.data

        residual = hidden_states
        hidden_states = self.post_layernorm(hidden_states)

        hidden_states = self.mlp(hidden_states)
        # Register as model output
        # ------------------------------------------------------
        self.register_network_output('mlp_output', hidden_states)
        # ------------------------------------------------------

        hidden_states = residual + hidden_states
```

2. Build the TensorRT engine of the model.

Enable the `--enable_debug_output` option when building engines with `trtllm-build`

```bash
cd examples/models/core/gpt

# Download hf gpt2 model
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd

# Convert to TensorRT-LLM checkpoint
python3 convert_checkpoint.py \
    --model_dir gpt2 \
    --dtype float16 \
    --output_dir gpt2/trt_ckpt/fp16/1-gpu

# Build TensorRT-LLM engines with --enable_debug_output
trtllm-build \
    --checkpoint_dir gpt2/trt_ckpt/fp16/1-gpu \
    --enable_debug_output \
    --output_dir gpt2/trt_engines/fp16/1-gpu
```

3. Print the intermediate output tensors.

Add debug info in `tensorrt_llm/runtime/generation.py`.

```python
        stream = torch.cuda.current_stream().cuda_stream
        instance_idx = step % 2
        if self.cuda_graph_mode and self.runtime.cuda_graph_instances[
                instance_idx] is not None:
            # launch cuda graph
            CUASSERT(
                cudart.cudaGraphLaunch(
                    self.runtime.cuda_graph_instances[instance_idx], stream))
            ok = True
        else:
            ok = self.runtime._run(context, stream)

        if not ok:
            raise RuntimeError(f"Executing TRT engine failed step={step}!")
        if self.debug_mode:
            torch.cuda.synchronize()
            # -------------------------------------------
            if step == 0:
                print(self.debug_buffer.keys())
            print(f"Step: {step}")
            print(self.debug_buffer['transformer.layers.6.mlp_output'])
            # -------------------------------------------
```

4. Run `../run.py` with `--debug_mode` and `--use_py_session`.

```bash
python3 ../run.py \
    --engine_dir gpt2/trt_engines/fp16/1-gpu \
    --tokenizer_dir gpt2 \
    --max_output_len 8 \
    --debug_mode \
    --use_py_session
```

5. See the value of the tensor.

```txt
......
dict_keys(['context_lengths', 'cache_indirection', 'position_ids', 'logits', 'last_token_ids', 'input_ids', 'kv_cache_block_pointers', 'host_kv_cache_block_pointers', 'sequence_length', 'host_past_key_value_lengths', 'host_sink_token_length', 'host_request_types', 'host_max_attention_window_sizes', 'host_context_lengths', 'transformer.layers.0.mlp_output', 'transformer.layers.1.mlp_output', 'transformer.layers.2.mlp_output', 'transformer.layers.3.mlp_output', 'transformer.layers.4.mlp_output', 'transformer.layers.5.mlp_output', 'transformer.layers.6.mlp_output', 'transformer.layers.7.mlp_output', 'transformer.layers.8.mlp_output', 'transformer.layers.9.mlp_output', 'transformer.layers.10.mlp_output', 'transformer.layers.11.mlp_output', 'transformer.layers.12.mlp_output', 'transformer.layers.13.mlp_output', 'transformer.layers.14.mlp_output', 'transformer.layers.15.mlp_output', 'transformer.layers.16.mlp_output', 'transformer.layers.17.mlp_output', 'transformer.layers.18.mlp_output', 'transformer.layers.19.mlp_output', 'transformer.layers.20.mlp_output', 'transformer.layers.21.mlp_output', 'transformer.layers.22.mlp_output', 'transformer.layers.23.mlp_output'])
Step: 0
tensor([[ 0.0294, -0.0260, -0.0776,  ..., -0.0560, -0.0235,  0.0273],
        [-0.0071,  0.5879,  0.1993,  ..., -1.0449, -0.6299,  0.5957],
        [-0.8779,  0.1050,  0.7090,  ...,  0.0910,  1.0713, -0.2939],
        ...,
        [ 0.1212, -0.0903, -0.5918,  ..., -0.1045, -0.3445,  0.1082],
        [-1.0723, -0.0732,  0.6157,  ...,  0.3452,  0.2998,  0.2649],
        [-0.7134,  0.9692, -0.1141,  ..., -0.0096,  0.9521,  0.1437]],
       device='cuda:0', dtype=torch.float16)
Step: 1
tensor([[-0.2107,  0.5874,  0.8179,  ...,  0.7900, -0.6890,  0.6064]],
       device='cuda:0', dtype=torch.float16)
Step: 2
tensor([[ 0.4192, -0.0047,  1.3887,  ..., -0.9028, -0.0682, -0.2820]],
       device='cuda:0', dtype=torch.float16)
Step: 3
tensor([[-0.7949, -0.5073, -0.1721,  ..., -0.5830, -0.1378, -0.0070]],
       device='cuda:0', dtype=torch.float16)
Step: 4
tensor([[-0.0804,  0.1272, -0.6255,  ..., -0.1072, -0.0523,  0.7144]],
       device='cuda:0', dtype=torch.float16)
Step: 5
tensor([[-0.3328, -0.8828,  0.3442,  ...,  0.8149, -0.0630,  1.2305]],
       device='cuda:0', dtype=torch.float16)
Step: 6
tensor([[-0.2225, -0.2079, -0.1459,  ..., -0.3555, -0.1672,  0.1135]],
       device='cuda:0', dtype=torch.float16)
Step: 7
tensor([[ 0.1290, -0.1556,  0.3977,  ..., -0.8218, -0.3291, -0.8672]],
       device='cuda:0', dtype=torch.float16)
Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef before moving to London in the early"
```

## Debug Execution Errors

If problems come from plugins, try setting the environment variable `CUDA_LAUNCH_BLOCKING=1` to make kernels launch synchronously with their return status checked immediately.

If problems come from runtime-shape of the input tensors, double-check the shape (rank and length of each rank) and location (CPU / GPU) of input tensors for the engine obey the build-time setting.

For example, one possible reason of getting the error information like below is, we use mismatched configuration between engine building and running, including code change (update of repo or users' rewrting), too large or too small input shape, etc..

```txt
unexpected shape for input 'XXX' for model 'YYY'. Expected [-1,-1,-1], got [8,16]. NOTE: Setting a non-zero max_batch_size in the model config requires a batch dimension to be prepended to each input shape. If you want to specify the full shape including the batch dim in your input dims config, try setting max_batch_size to zero. See the model configuration docs for more info on max_batch_size.

[TensorRT-LLM][ERROR] Assertion failed: Tensor 'input_ids' has invalid shape (8192), expected (-1) (/code/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 8192 but got size 1024 for tensor number 1 in the list.
```

By setting environment variable `export TLLM_LOG_LEVEL=TRACE`, we can get more information about the TensorRT engine and context at runtime.

Before the first forward computation, the shapes of all input / output tensors and their corresponding allowed ranges are provided in a table like:

```txt
[TensorRT-LLM][TRACE] Information of engine input / output.
[TensorRT-LLM][TRACE] =====================================================================
[TensorRT-LLM][TRACE]              Name              |I/O|Location|DataType|    Shape     |
[TensorRT-LLM][TRACE] ---------------------------------------------------------------------
[TensorRT-LLM][TRACE] input_ids                      | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] position_ids                   | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] last_token_ids                 | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] kv_cache_block_offsets         | I |  GPU   | INT32  |(1, -1, 2, -1)|
[TensorRT-LLM][TRACE] host_kv_cache_block_offsets    | I |  GPU   | INT32  |(1, -1, 2, -1)|
[TensorRT-LLM][TRACE] host_kv_cache_pool_pointers    | I |  GPU   | INT64  |    (1, 2)    |
[TensorRT-LLM][TRACE] host_kv_cache_pool_mapping     | I |  GPU   | INT32  |     (28)     |
[TensorRT-LLM][TRACE] sequence_length                | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] host_request_types             | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] host_past_key_value_lengths    | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] context_lengths                | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] host_runtime_perf_knobs        | I |  GPU   | INT64  |     (16)     |
[TensorRT-LLM][TRACE] host_context_lengths           | I |  GPU   | INT32  |     (-1)     |
[TensorRT-LLM][TRACE] host_max_attention_window_sizes| I |  GPU   | INT32  |     (28)     |
[TensorRT-LLM][TRACE] host_sink_token_length         | I |  GPU   | INT32  |     (1)      |
[TensorRT-LLM][TRACE] cache_indirection              | I |  GPU   | INT32  | (-1, 1, -1)  |
[TensorRT-LLM][TRACE] logits                         | O |  GPU   |  FP32  | (-1, 65024)  |
[TensorRT-LLM][TRACE] =====================================================================
[TensorRT-LLM][TRACE] Information of optimization profile.
[TensorRT-LLM][TRACE] Optimization Profile 0:
[TensorRT-LLM][TRACE] =============================================================================
[TensorRT-LLM][TRACE]              Name              |     Min      |     Opt      |     Max      |
[TensorRT-LLM][TRACE] -----------------------------------------------------------------------------
[TensorRT-LLM][TRACE] input_ids                      |     (1)      |     (8)      |    (8192)    |
[TensorRT-LLM][TRACE] position_ids                   |     (1)      |     (8)      |    (8192)    |
[TensorRT-LLM][TRACE] last_token_ids                 |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] kv_cache_block_offsets         | (1, 1, 2, 1) |(1, 4, 2, 16) |(1, 8, 2, 32) |
[TensorRT-LLM][TRACE] host_kv_cache_block_offsets    | (1, 1, 2, 1) |(1, 4, 2, 16) |(1, 8, 2, 32) |
[TensorRT-LLM][TRACE] host_kv_cache_pool_pointers    |    (1, 2)    |    (1, 2)    |    (1, 2)    |
[TensorRT-LLM][TRACE] host_kv_cache_pool_mapping     |     (28)     |     (28)     |     (28)     |
[TensorRT-LLM][TRACE] sequence_length                |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] host_request_types             |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] host_past_key_value_lengths    |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] context_lengths                |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] host_runtime_perf_knobs        |     (16)     |     (16)     |     (16)     |
[TensorRT-LLM][TRACE] host_context_lengths           |     (1)      |     (4)      |     (8)      |
[TensorRT-LLM][TRACE] host_max_attention_window_sizes|     (28)     |     (28)     |     (28)     |
[TensorRT-LLM][TRACE] host_sink_token_length         |     (1)      |     (1)      |     (1)      |
[TensorRT-LLM][TRACE] cache_indirection              |  (1, 1, 1)   | (4, 1, 1024) | (8, 1, 2048) |
[TensorRT-LLM][TRACE] logits                         |  (1, 65024)  |  (4, 65024)  |  (8, 65024)  |
[TensorRT-LLM][TRACE] =============================================================================
```

Before each forward computation, the real shapes of all input / output tensors for TRT engine are provided by a table like:

```txt
[TensorRT-LLM][TRACE] Information of context input / output.
[TensorRT-LLM][TRACE] Using Optimization Profile: 0
[TensorRT-LLM][TRACE] =================================================
[TensorRT-LLM][TRACE]              Name              |I/O|   Shape    |
[TensorRT-LLM][TRACE] -------------------------------------------------
[TensorRT-LLM][TRACE] input_ids                      | I |    (33)    |
[TensorRT-LLM][TRACE] position_ids                   | I |    (33)    |
[TensorRT-LLM][TRACE] last_token_ids                 | I |    (3)     |
[TensorRT-LLM][TRACE] kv_cache_block_offsets         | I |(1, 3, 2, 4)|
[TensorRT-LLM][TRACE] host_kv_cache_block_offsets    | I |(1, 3, 2, 4)|
[TensorRT-LLM][TRACE] host_kv_cache_pool_pointers    | I |   (1, 2)   |
[TensorRT-LLM][TRACE] host_kv_cache_pool_mapping     | I |    (28)    |
[TensorRT-LLM][TRACE] sequence_length                | I |    (3)     |
[TensorRT-LLM][TRACE] host_request_types             | I |    (3)     |
[TensorRT-LLM][TRACE] host_past_key_value_lengths    | I |    (3)     |
[TensorRT-LLM][TRACE] context_lengths                | I |    (3)     |
[TensorRT-LLM][TRACE] host_runtime_perf_knobs        | I |    (16)    |
[TensorRT-LLM][TRACE] host_context_progress          | I |    (1)     |
[TensorRT-LLM][TRACE] host_context_lengths           | I |    (3)     |
[TensorRT-LLM][TRACE] host_max_attention_window_sizes| I |    (28)    |
[TensorRT-LLM][TRACE] host_sink_token_length         | I |    (1)     |
[TensorRT-LLM][TRACE] cache_indirection              | I |(3, 2, 256) |
[TensorRT-LLM][TRACE] logits                         | O | (3, 65024) |
[TensorRT-LLM][TRACE] =================================================
```

## Tips

* It's recommended to add options `–shm-size=1g –ulimit memlock=-1` to the
  docker or nvidia-docker run command.  Otherwise you may see NCCL errors when
  running multiple GPU inferences. See
  https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#errors
  for details.

* When building models, memory-related issues such as
```
[09/23/2023-03:13:00] [TRT] [E] 9: GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types
[09/23/2023-03:13:00] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types)
```
may happen. One possible solution is to reduce the amount of memory needed by
reducing the maximum batch size, input and output lengths. Another option is to
enable plugins, for example: `--gpt_attention_plugin`.

* MPI + Slurm

TensorRT-LLM is a
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package
that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are
running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might
encounter interferences:
```
--------------------------------------------------------------------------
PMI2_Init failed to initialize.  Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
```

You may experience other problems like hanging on the program startup.

As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
dedicated MPI environment, not the one provided by your Slurm allocation.

For example: `mpirun -n 1 python3 examples/models/core/gpt/build.py ...`

It's critical that it's always `-n 1` regardless of how many GPUs are being used. If you'd use `-n 2` for a 2 GPU program it will not work. `mpirun` here isn't being used to orchestrate multiple processes, but to invoke the right environment on SLURM. The internal MPI implementation deals with spawning the additional processes.

---

# LLM API with TensorRT Engine
A simple inference example with TinyLlama using the LLM API:

```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
    :language: python
    :linenos:
```

For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).

---

# PyTorch Backend

```{note}
Note:
This feature is currently in beta, and the related API is subjected to change in future versions.
```
To enhance the usability of the system and improve developer efficiency, TensorRT LLM launches a new backend based on PyTorch.

The PyTorch backend of TensorRT LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`.

## Quick Start

Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama model.

```{literalinclude} ../../examples/llm-api/quickstart_example.py
    :language: python
    :linenos:
```

## Features

- [Sampling](./torch/features/sampling.md)
- [Quantization](./torch/features/quantization.md)
- [Overlap Scheduler](./torch/features/overlap_scheduler.md)
- [Feature Combination Matrix](./torch/features/feature_combination_matrix.md)

## Developer Guide

- [Architecture Overview](./torch/arch_overview.md)
- [Adding a New Model](./torch/adding_new_model.md)

## Key Components

- [Attention](./torch/attention.md)
- [KV Cache Manager](./torch/kv_cache_manager.md)
- [Scheduler](./torch/scheduler.md)

## Known Issues

- The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

## Prototype Features

- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT LLM](./torch/auto_deploy/auto-deploy.md)

---

# LLM API Introduction

The LLM API is a high-level Python API designed to streamline LLM inference workflows.

It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo).

While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.


## Quick Start Example
A simple inference example with TinyLlama using the LLM API:

```{literalinclude} ../../../examples/llm-api/quickstart_example.py
    :language: python
    :linenos:
```

For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).

## Model Input

The `LLM()` constructor accepts either a Hugging Face model ID or a local model path as input.

### 1. Using a Model from the Hugging Face Hub

To load a model directly from the [Hugging Face Model Hub]((https://huggingface.co/)), simply pass its model ID (i.e., repository name) to the LLM constructor. The model will be automatically downloaded:

```python
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
```

You can also use [quantized checkpoints](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) (FP4, FP8, etc) of popular models provided by NVIDIA in the same way.

### 2. Using a Local Hugging Face Model

To use a model from local storage, first download it manually:

```console
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
```

Then, load the model by specifying a local directory path:

```python
llm = LLM(model=<local_path_to_model>)
```

> **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.


## Tips and Troubleshooting

The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:

### RuntimeError: only rank 0 can start multi-node session, got 1

  There is no need to add an `mpirun` prefix for launching single node multi-GPU inference with the LLM API.

  For example, you can run `python llm_inference_distributed.py` to perform multi-GPU on a single node.

### Hang issue on Slurm Node

  If you experience a hang or other issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch script.

  For example, try `mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py`.

### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.

  Because the LLM API relies on the `mpi4py` library, put the LLM class in a function and protect the main entrypoint to the program under the `__main__` namespace to avoid a [recursive spawn](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#mpipoolexecutor) process in `mpi4py`.

  This limitation is applicable for multi-GPU inference only.

### Cannot quit after generation

  The LLM instance manages threads and processes, which may prevent its reference count from reaching zero. To address this issue, there are two common solutions:
  1. Wrap the LLM instance in a function, as demonstrated in the quickstart guide. This will reduce the reference count and trigger the shutdown process.
  2. Use LLM as an contextmanager, with the following code: `with LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it goes out of the `with`-statement block.

### Single node hanging when using `docker run --net=host`

The root cause may be related to `mpi4py`. There is a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) suggesting a change from `--net=host` to `--ipc=host`, or setting the following environment variables:

```bash
export OMPI_MCA_btl_tcp_if_include=lo
export OMPI_MCA_oob_tcp_if_include=lo
```

Another option to improve compatibility with `mpi4py` is to launch the task using:

```bash
mpirun -n 1 --oversubscribe --allow-run-as-root python my_llm_task.py
```

This command can help avoid related runtime issues.

---

# Adding a New Model

## Table of Contents
1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Step-by-Step Guide](#step-by-step-guide)
    1. [Model Configuration](#model-configuration)
    2. [Model Definition](#model-definition)
    3. [Weight Loading](#weight-loading)
    4. [Model Registration](#model-registration)
        1. [Core Models](#core-models)
        2. [Out-of-Tree Models](#out-of-tree-models)

## Introduction

This guide provides a step-by-step process for adding a new model in PyTorch Backend.

## Prerequisites

Before you begin, ensure you have the following:
- A working installation of TensorRT-LLM. Follow these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/build-from-source-linux.md).

## Step-by-Step Guide

### Model Configuration

Suppose you want to support a new model named `MyModel`. If the model is already supported in HuggingFace's transformers, you should bring the PyTorch modeling code and reuse HuggingFace's configuration class. For example, our `tensorrt_llm/_torch/models/modeling_llama.py` was adapted from HuggingFace's [modeling_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py); in the modeling code, we reuse the configuration class:

```python
from transformers import LlamaConfig
```

If the model is not registered in HuggingFace's transformers, you need to define the configuration class in your `configuration_mymodel.py` following HuggingFace's [configuration_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/configuration_llama.py):

```python
from transformers.configuration_utils import PretrainedConfig

class MyConfig(PretrainedConfig):
    def __init__(self, ...):
        ...
```

### Model Definition

Remove any unnecessary code (e.g., training-specific code), and then rewrite some PyTorch modules. For a typical Transformer decoder model, you need to implement your `modeling_mymodel.py` like this:

```python
from typing import Optional

import torch
from torch import nn
from tensorrt_llm._torch.attention_backend import AttentionMetadata
from tensorrt_llm._torch.model_config import ModelConfig
from tensorrt_llm._torch.models.modeling_utils import DecoderModel, DecoderModelForCausalLM
from tensorrt_llm._torch.modules.attention import Attention
from tensorrt_llm._torch.modules.decoder_layer import DecoderLayer

from configuration_mymodel import MyConfig


class MyAttention(Attention):
    def __init__(self, model_config: ModelConfig[MyConfig], layer_idx: Optional[int] = None):
        # Use model_config to initialize the Attention module
        super().__init__(...)


class MyDecoderLayer(DecoderLayer):
    def __init__(self, model_config: ModelConfig[MyConfig], layer_idx: int):
        super().__init__()
        # Use model_config to initialize the submodules
        self.input_layernorm = ...
        self.self_attn = MyAttention(model_config, layer_idx)
        self.post_attention_layernorm = ...
        self.mlp = ...

    def forward(self, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, **kwargs):
        # Define the forward computation of a single decoder layer
        ...


class MyModel(DecoderModel):
    def __init__(self, model_config: ModelConfig[MyConfig]):
        super().__init__(model_config)
        # Use model_config to initialize the submodules
        self.embed_tokens = ...
        self.layers = nn.ModuleList([
            MyDecoderLayer(model_config, layer_idx) for layer_idx in range(model_config.pretrained_config.num_hidden_layers)
        ])

    def forward(self,
                attn_metadata: AttentionMetadata,
                input_ids: Optional[torch.IntTensor] = None,
                position_ids: Optional[torch.IntTensor] = None,
                inputs_embeds: Optional[torch.FloatTensor] = None):
        # Define the forward computation of the model
        ...


class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):
    def __init__(self, model_config: ModelConfig[MyConfig]):
        super().__init__(MyModel(model_config),
                         config=model_config,
                         hidden_size=model_config.pretrained_config.hidden_size,
                         vocab_size=model_config.pretrained_config.vocab_size)
```

Note that `MyAttention` inherits from our `Attention` module (in `tensorrt_llm/_torch/modules/attention.py`), so that the attention computation is compatible with our PyTorch runtime. Related to this, module inputs should also be adapted:

- The `attn_metadata` stores the metadata from the batched input and KV cache for the attention backend. It is created by and passed from the runtime, and model developers need to ensure that `attn_metadata` is correctly passed to the attention module.
- The input tensors (i.e., `input_ids`, `position_ids`, `hidden_states`) are in the packed mode. The first dimension corresponds to the number of tokens in a batch.

Additionally, `MyDecoderLayer`, `MyModel`, and `MyModelForCausalLM` are subclasses of `DecoderLayer`, `DecoderModel`, and `DecoderModelForCausalLM` respectively. The base classes define interfaces and provide a generic scaffolding to define model layers, load weights, etc.

Optionally, you may replace the native PyTorch modules with our implementations to enable features or achieve higher performance:
- `Linear` (in `tensorrt_llm/_torch/modules/linear.py`): Enables tensor parallelism and quantization.
- `Embedding` (in `tensorrt_llm/_torch/modules/embedding.py`): Enables tensor parallelism for embedding.
- `RotaryEmbedding` (in `tensorrt_llm/_torch/modules/rotary_embedding.py`): Enables performant rotary embedding.
- `RMSNorm` (in `tensorrt_llm/_torch/modules/rms_norm.py`): Enables performant RMS norm.

For a concrete reference, check out `tensorrt_llm/_torch/models/modeling_llama.py`.

### Weight Loading

The base class `DecoderModelForCausalLM` provides a `load_weights` method that loads the weights from the checkpoint file and assigns them to the corresponding layers in the model. However, if the default method does not work for `MyModelForCausalLM`, you need to implement your own `load_weights`:

```python
class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):

    def load_weights(self, weights: dict):
        # Define the weight loading logic
        ...
```

For example, Huggingface's LLaMA model uses three linear layers for Q/K/V projections, resulting in three weight tensors in the checkpoint:

```python
>>> weights
{
    ...,
    "model.layers.0.self_attn.q_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.k_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.v_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    ...,
}
```

However, our LLaMA model fuses the three layers into one linear layer:

```python
>>> llama.model.layers[0].self_attn.qkv_proj.weight.data
torch.Tensor([hidden_size * 3, hidden_size])
```

Hence, `load_weights` needs to collect the three weight tensors from the original checkpoint, concatenate them, and assign them to the fused linear layer. Considering tensor parallelism and quantization, the process would be more complicated. We recommend calling the predefined module-level `load_weights` (e.g., `Linear` and `Embedding`) when implementing your model-level `load_weights` method.

Overall, `load_weights` should handle any discrepancy between `MyModelForCausalLM` and the weights loaded from the checkpoint, so that `MyModelForCausalLM` can perform forward computation equivalent to the original model.

### Model Registration

The new model needs to be registered so that it can be recognized by the PyTorch runtime. The registration can be done simply by adding the `register_auto_model` decorator for `MyModelForCausalLM`:

```python
from tensorrt_llm._torch.models.modeling_utils import register_auto_model

@register_auto_model("MyModelForCausalLM")
class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):
    def __init__(self, model_config: ModelConfig[MyConfig]):
       ...
```

#### Core Models

To add the new model to core models, `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) should be placed in `tensorrt_llm/_torch/models`. Then, you need to import the modeling code in `tensorrt_llm/_torch/models/__init__.py`:

```python
from .modeling_mymodel import MyModelForCausalLM

__all__ = [
    ...,
    "MyModelForCausalLM",
]
```

#### Out-of-Tree Models

Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:

```python
from tensorrt_llm import LLM
import modeling_mymodel

def main():
    llm = LLM(...)

if __name__ == '__main__':
    main()
```

We provide an out-of-tree modeling example in `examples/pytorch/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:

```bash
python examples/pytorch/out_of_tree_example/main.py
```

---

(support-matrix)=
# Supported Models

The following is a table of supported models for the PyTorch backend:

| Architecture                         | Model                              | HuggingFace Example                          |
| ------------------------------------ | ---------------------------------- | -------------------------------------------- |
| `BertForSequenceClassification`      | BERT-based                         | `textattack/bert-base-uncased-yelp-polarity` |
| `DeciLMForCausalLM`                  | Nemotron                           | `nvidia/Llama-3_1-Nemotron-51B-Instruct`     |
| `DeepseekV3ForCausalLM`              | DeepSeek-V3                        | `deepseek-ai/DeepSeek-V3`                    |
| `DeepseekV32ForCausalLM`             | DeepSeek-V3.2                      | `deepseek-ai/DeepSeek-V3.2`                  |
| `Exaone4ForCausalLM`                 | EXAONE 4.0                         | `LGAI-EXAONE/EXAONE-4.0-32B`                 |
| `Gemma3ForCausalLM`                  | Gemma 3                            | `google/gemma-3-1b-it`                       |
| `GptOssForCausalLM`                  | GPT-OSS                            | `openai/gpt-oss-120b`                        |
| `LlamaForCausalLM`                   | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B`              |
| `Llama4ForConditionalGeneration`     | Llama 4                            | `meta-llama/Llama-4-Scout-17B-16E-Instruct`  |
| `MistralForCausalLM`                 | Mistral                            | `mistralai/Mistral-7B-v0.1`                  |
| `MixtralForCausalLM`                 | Mixtral                            | `mistralai/Mixtral-8x7B-v0.1`                |
| `MllamaForConditionalGeneration`     | Llama 3.2                          | `meta-llama/Llama-3.2-11B-Vision`            |
| `NemotronForCausalLM`                | Nemotron-3, Nemotron-4, Minitron   | `nvidia/Minitron-8B-Base`                    |
| `NemotronHForCausalLM`               | Nemotron-3-Nano                    | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8`  |
| `NemotronNASForCausalLM`             | NemotronNAS                        | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`     |
| `Phi3ForCausalLM`                    | Phi-4                              | `microsoft/Phi-4`                            |
| `Qwen2ForCausalLM`                   | QwQ, Qwen2                         | `Qwen/Qwen2-7B-Instruct`                     |
| `Qwen2ForProcessRewardModel`         | Qwen2-based                        | `Qwen/Qwen2.5-Math-PRM-7B`                   |
| `Qwen2ForRewardModel`                | Qwen2-based                        | `Qwen/Qwen2.5-Math-RM-72B`                   |
| `Qwen3ForCausalLM`                   | Qwen3                              | `Qwen/Qwen3-8B`                              |
| `Qwen3MoeForCausalLM`                | Qwen3MoE                           | `Qwen/Qwen3-30B-A3B`                         |
| `Qwen3NextForCausalLM`               | Qwen3Next                          | `Qwen/Qwen3-Next-80B-A3B-Thinking`           |


## Model-Feature Support Matrix(Key Models)

Note: Support for other models may vary. Features marked "N/A" are not applicable to the model architecture.

| Model Architecture/Feature     | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | Chunked Prefill | MTP | EAGLE-3(One Model Engine) | EAGLE-3(Two Model Engine) | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Sliding Window Attention | Logits Post Processor | Guided Decoding |
| ------------------------------ | ----------------- | ---------- | -------------------------- | --------------------- | --------------- | --- | ------------------------- | ------------------------- | ------------- | ---------------- | -------------- | ------------------------ | --------------------- | --------------- |
| `DeepseekV3ForCausalLM`          | Yes               | Yes        | Yes                        | Yes                   | Yes [^1]        | Yes | No                        | No                        | Yes           | Yes              | Yes [^2]       | N/A                      | Yes                   | Yes             |
| `DeepseekV32ForCausalLM`         | Yes               | Yes        | Yes                        | Yes                   | Yes             | Yes | No                        | No                        | Yes           | Yes              | Yes            | N/A                      | Yes                   | Yes             |
| `Qwen3MoeForCausalLM`            | Yes               | Yes        | Yes                        | Yes                   | Yes             | No  | Yes                       | Yes                       | Yes           | Yes              | Yes            | N/A                      | Yes                   | Yes             |
| `Qwen3NextForCausalLM`           | Yes                | Yes        | No                         | Untested                    | Yes              | No  | No                        | No                        | Yes            | Yes               | No             | No                       | Untested                    | Untested              |
| `Llama4ForConditionalGeneration` | Yes               | Yes        | Yes                        | Yes                   | Yes             | No  | Yes                       | Yes                       | Yes           | Yes              | Untested       | N/A                      | Yes                   | Yes             |
| `GptOssForCausalLM`            | Yes              | Yes         | Yes                        | Yes                   | No             | No   | Yes                       | No                        | Yes           | Yes              | No             | N/A                      | Yes                    | Yes             |

[^1]: Chunked Prefill for MLA can only be enabled on SM100/SM103.
[^2]: KV cache reuse for MLA can only be enabled on SM90/SM100/SM103 and in BF16/FP8 KV cache dtype.


# Multimodal Feature Support Matrix (PyTorch Backend)

| Model Architecture/Feature           | Overlap Scheduler | CUDA Graph | Chunked Prefill | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Logits Post Processor | EPD Disaggregated Serving | Modality  |
| ------------------------------------ | ----------------- | ---------- | --------------- | ------------- | ---------------- | -------------- | --------------------- | ------------------------- | --------- |
| `Gemma3ForConditionalGeneration`     | Yes               | Yes        | N/A             | Yes           | Yes              | N/A            | Yes                   | No                        | L + I     |
| `HCXVisionForCausalLM`               | Yes               | Yes        | No              | Yes           | Yes              | Yes            | Yes                   | No                        | L + I     |
| `LlavaLlamaModel (VILA)`             | Yes               | Yes        | No              | Yes           | Yes              | No             | Yes                   | No                        | L + I + V |
| `LlavaNextForConditionalGeneration`  | Yes               | Yes        | Yes             | Yes           | Yes              | Yes            | Yes                   | Yes                       | L + I     |
| `Llama4ForConditionalGeneration`     | Yes               | Yes        | No              | Yes           | Yes              | No             | Yes                   | No                        | L + I     |
| `Mistral3ForConditionalGeneration`   | Yes               | Yes        | Yes             | Yes           | Yes              | Yes            | Yes                   | No                        | L + I     |
| `NemotronH_Nano_VL_V2`               | Yes               | Yes        | Yes             | Yes           | Yes              | N/A            | Yes                   | No                        | L + I + V |
| `Phi4MMForCausalLM`                  | Yes               | Yes        | Yes             | Yes           | Yes              | Yes            | Yes                   | No                        | L + I + A |
| `Qwen2VLForConditionalGeneration`    | Yes               | Yes        | Yes             | Yes           | Yes              | Yes            | Yes                   | No                        | L + I + V |
| `Qwen2_5_VLForConditionalGeneration` | Yes               | Yes        | Yes             | Yes           | Yes              | Yes            | Yes                   | No                        | L + I + V |

Note:
- L: Language
- I: Image
- V: Video
- A: Audio

---

(product-overview)=

# Overview

## About TensorRT LLM

[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.

## Key Capabilities

### 🔥 **Architected on Pytorch**

TensorRT LLM provides a high-level Python [LLM API](./quick-start-guide.md#run-offline-inference-with-llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).

TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](source:tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.

### ⚡ **State-of-the-Art Performance**

TensorRT LLM delivers breakthrough performance on the latest NVIDIA GPUs:

- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
- **Llama 4 Maverick**: [Breaks the 1,000 TPS/User Barrier on B200 GPUs](https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/)

### 🎯 **Comprehensive Model Support**

TensorRT LLM supports the latest and most popular LLM [architectures](https://nvidia.github.io/TensorRT-LLM/models/supported-models.html).

- **Language Models**: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4...
- **Multi-modal Models**: LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision...

TensorRT LLM strives to support the most popular models on **Day 0**.

### FP4 Support
[NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/) , when used with TensorRT LLM, enable seamless loading of model weights in the new [FP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.

### FP8 Support

TensorRT LLM strives to support the most popular models on **Day 0**.

### 🚀 **Advanced Optimization & Production Features**
- **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
- **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
- **[Advanced Quantization](./features/quantization.md)**:
  - **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
  - **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
- **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
- **[KV Cache Management](./features/kvcache.md)**: Paged KV cache with intelligent block reuse and memory optimization
- **[Chunked Prefill](./features/paged-attention-ifb-scheduler.md)**: Efficient handling of long sequences by splitting context into manageable chunks
- **[LoRA Support](./features/lora.md)**: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
- **[Checkpoint Loading](./features/checkpoint-loading.md)**: Flexible model loading from various formats (HuggingFace, NeMo, custom)
- **[Guided Decoding](./features/guided-decoding.md)**: Advanced sampling with stop words, bad words, and custom constraints
- **[Disaggregated Serving (Beta)](./features/disagg-serving.md)**: Separate context and generation phases across different GPUs for optimal resource utilization

### 🔧 **Latest GPU Architecture Support**

TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization
- **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
- **NVIDIA Ampere**: A100, RTX 30 series for production workloads

## What Can You Do With TensorRT LLM?

Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI.To get started, refer to the {ref}`quick-start-guide`.

---

(quick-start-guide)=

# Quick Start Guide

This is the starting point to try out TensorRT LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT LLM.


## Launch Docker Container

The [TensorRT LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) maintained by NVIDIA contains all of the required dependencies pre-installed. You can start the container on a machine with NVIDIA GPUs via:

```bash
docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:x.y.z
```


(deploy-with-trtllm-serve)=
## Deploy Online Serving with trtllm-serve

You can use the `trtllm-serve` command to start an OpenAI compatible server to interact with a model.
To start the server, you can run a command like the following example inside a Docker container:

```bash
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

You may also deploy pre-quantized models to improve performance.
Ensure your GPU supports FP8 quantization before running the following:

```bash
trtllm-serve "nvidia/Qwen3-8B-FP8"
```

For more options, browse the full [collection of generative models](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer) that have been quantized and optimized for inference with the TensorRT Model Optimizer.

```{note}
If you are running `trtllm-serve` inside a Docker container, you have two options for sending API requests:
1. Expose a port (e.g., 8000) to allow external access to the server from outside the container.
2. Open a new terminal and use the following command to directly attach to the running container:
```bash
docker exec -it <container_id> bash
```

After the server has started, you can access well-known OpenAI endpoints such as `v1/chat/completions`.
Inference can then be performed using examples similar to the one provided below, from a separate terminal.

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'
```

_Example Output_

```json
{
  "id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
  "object": "chat.completion",
  "created": 1741966075,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 43,
    "total_tokens": 69,
    "completion_tokens": 26
  }
}
```

For detailed examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve/trtllm-serve.rst) section.

```{note}
Pre-configured settings for deploying popular models with `trtllm-serve` can be found in our [deployment guides](deployment-guide/index.rst).
```

## Run Offline Inference with LLM API
The LLM API is a Python API designed to facilitate setup and inference with TensorRT LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing model loading, optimization, and inference, all through a single `LLM` instance.

Here is a simple example to show how to use the LLM API with TinyLlama.

```{literalinclude} ../../examples/llm-api/quickstart_example.py
    :language: python
    :linenos:
```

You can also directly load pre-quantized models [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).

## Next Steps

In this Quick Start Guide, you have:

- Learned how to deploy a model with `trtllm-serve` for online serving
- Explored the LLM API for offline inference with TensorRT LLM

To continue your journey with TensorRT LLM, explore these resources:

- **[Installation Guide](installation/index.rst)** - Detailed installation instructions for different platforms
- **[Model-Specific Deployment Guides](deployment-guide/index.rst)** - Instructions for serving specific models with TensorRT LLM
- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios
- **[Model Support](models/supported-models.md)** - Check which models are supported and how to add new ones
- **CLI Reference** - Explore TensorRT LLM command-line tools:
  - [`trtllm-serve`](commands/trtllm-serve/trtllm-serve.rst) - Deploy models for online serving
  - [`trtllm-bench`](commands/trtllm-bench.rst) - Benchmark model performance
  - [`trtllm-eval`](commands/trtllm-eval.rst) - Evaluate model accuracy

---

(release-notes)=

# Release Notes

All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).

## TensorRT-LLM Release 1.1

### Key Features and Enhancements

- **Model Support**
  - Add GPT-OSS model support.
  - Add Hunyuan-Dense model support. Thanks to the contribution from @sorenwu.
  - Add Hunyuan-MoE model support. Thanks to the contribution from @qianbiaoxiang.
  - Add Seed-OSS model support. Thanks to the contribution from @Nekofish-L.

- **Features**
  - **KV Cache & Context:**
    - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving.
    - **Reuse & Offloading:** Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading.
    - **Salting:** Implemented KV cache salting for secure cache reuse.
  - **Speculative Decoding:**
    - **Guided Decoding Integration:** Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill).
    - **Eagle:** Added multi-layer Eagle support and optimizations.
  - **Disaggregated Serving:**
    - Added support for Guided Decoding in disaggregated mode.
    - Optimized KV cache transfer for uneven pipeline parallelism.
  - **Performance:**
    - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication.
    - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance.
    - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell.
  - **Hardware:**
    - **B300/GB300:** Added support for B300/GB300.
- **Benchmark**
  - **New Benchmarks:**
    - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`).
    - **Multimodal:** Enabled `benchmark_serving` support for multimodal models.
    - **NIM:** Added specific performance test cases for NIM (NVIDIA Inference Microservices) integration.
  - **Tooling Improvements:**
    - **trtllm-bench:** Added support for sampler options, accurate device iteration timing, and improved data loading for benchmark datasets.
    - **Metrics:** Enhanced reporting to include KV cache size metrics in benchmark results.
    - **Scaffolding:** Added benchmark support for scaffolding examples.
- **Documentation**
  - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0.
  - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy.
  - **Tech Blogs:** Published blogs on "[Combining Guided Decoding and Speculative Decoding](./blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md)" and "[ADP Balance Strategy](./blogs/tech_blog/blog10_ADP_Balance_Strategy.md)".
  - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows).
  - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs.
  - **Performance:** Updated online benchmarking documentation and performance overview pages.
  - **Examples:** Refined Slurm examples and added K2 tool calling examples.

### Infrastructure Changes

- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.10-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.10-py3`.
- The dependent public PyTorch version is updated to 2.9.0.
- The dependent NVIDIA ModelOpt version is updated to 0.37.
- The dependent xgrammar version is updated to 0.1.25.
- The dependent transformers version is updated to 4.56.0.
- The dependent NIXL version is updated to 0.5.0.

### API Changes

- **Breaking Change**: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation. A new `sampler_type` argument has been introduced to `SamplingConfig` to explicitly control sampler selection.
- **KV Cache Connector API:** Introduced a new KV Cache Connector API to facilitate state transfer between Disaggregated Serving workers (Context and Generation phases).
- **LLM API Enhancements:**
  - Added support for `prompt_logprobs` in the PyTorch backend.
  - Standardized `topk` logprob returns across TRT and PyTorch backends.
  - Added stable labels to arguments in the `LLM` class to better indicate API stability.
- **Response API:** Added basic functionality for the Responses API to better handle streaming and non-streaming responses.
- **Multimodal Inputs:** Updated the `MultimodalParams` API to support `SharedTensor`, improving memory management for visual language models.
- **Wait and Cancel API:** Added tests and support for handling non-existent and completed request cancellations in the executor.

### Fixed Issues

- **DeepSeek-V3/R1:**
  - Fixed potential hangs in DeepSeek-V3 pipelines by adjusting MNNVL configurations.
  - Resolved illegal memory access errors in FP8 Scout and DeepSeek models.
  - Fixed weight loading issues for DeepSeek-R1 W4A8 checkpoints (TP16 scenarios).
- **Llama 4:** Fixed FP4 generation issues and corrected all-reduce operations in the last decoder layer.
- **Mistral/Pixtral:** Fixed a batching bug in Mistral 3.1 where processing multiple requests with images in the same batch caused failures.
- **Qwen:** Fixed Qwen2.5-VL failures related to CUDA graph padding and transformers version compatibility.
- **Gemma:** Fixed out-of-bounds vector access for models with multiple layer types and resolved accuracy issues in Gemma 2.
- **Speculative Decoding:**
  - Fixed race conditions in one-model speculative decoding.
  - Resolved CUDA graph warmup issues that caused failures when using speculative decoding.
  - Fixed KV cache recompute logic in `draft_target` speculative decoding.
- **MoE (Mixture of Experts):**
  - Fixed OOM issues in fused MoE kernels by optimizing workspace pre-allocation.
  - Corrected Cutlass MoE integration to fix accuracy issues on Blackwell hardware.
  - Fixed W4A8 MoE kernel issues on Hopper architecture.
- **General:**
  - Fixed a potential hang caused by Python multiprocessing when prefetching weights.
  - Resolved an issue where `torch.onnx.export` would fail with newer PyTorch versions by correctly falling back to non-dynamo modes.
  - Fixed numerical stability issues for XQA kernels when using speculative decoding.
  - Fixed a memory leak in the `cacheTransceiver` that could lead to hangs in disaggregated serving.

### Known Issues

- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+.


## TensorRT-LLM Release 1.0

TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

### Key Features and Enhancements
- **Model Support**
  - Add Mistral3.1 VLM model support
  - Add TensorRT-Engine Qwen3 (dense) model support
  - Add phi-4-multimodal model support
  - Add EXAONE 4.0 model support
  - Add Qwen3 MoE support to TensorRT backend

- **Features**
  - Add support for sm121
  - Add LoRA support for Gemma3
  - Support PyTorch LoRA adapter eviction
  - Add LoRA support for PyTorch backend in trtllm-serve
  - Add support of scheduling attention dp request
  - Remove padding of FusedMoE in attention DP
  - Support torch compile for attention dp
  - Add KV events support for sliding window attention
  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
  - Add Piecewise CUDA Graph support for MLA
  - Support mutliCtasKvMode for high-throughput MLA kernels
  - Enable kvcache to be reused during request generation
  - Add ADP schedule balance optimization
  - Add chunked prefill support for MLA (Blackwell)
  - Enable Multi-block mode for Hopper spec dec XQA kernel
  - Add vLLM KV Pool support for XQA kernel
  - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
  - Add support for fused gate_up_proj scales for FP8 blockwise
  - Support FP8 row-wise dense GEMM in torch flow
  - Enable fp8 SwiGLU to minimize host overhead
  - Add Deepseek R1 FP8 Support on Blackwell
  - Add support for MXFP8xMXFP4 in pytorch
  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
  - Opensource MOE MXFP8-MXFP4 implementation
  - Add support for Modelopt fp8_pb_wo quantization scheme
  - Support deepEP fp4 post quant all2all dispatch
  - Fuse w4a8 moe pre-quant scale on Hopper
  - Support Weight-Only-Quantization in PyTorch Workflow
  - Add support for per expert activation scaling factors
  - Add ReDrafter support for Qwen
  - Enable CUDA Graph for Nemotron-H
  - Add support for YARN in NemotronNAS models
  - Switch to internal version of MMProjector in Gemma3
  - Disable add special tokens for Llama3.3 70B
  - Auto-enable ngram with concurrency <= 32
  - Support turning on/off spec decoding dynamically
  - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
  - Add support for external multimodal embeddings
  - Add support for disaggregation with pp with pytorch backend
  - Add status tags to LLM API reference
  - Support JSON Schema in OpenAI-Compatible API
  - Support chunked prefill on spec decode 2 model
  - Add KV cache reuse support for multimodal models
  - Support nanobind bindings
  - Add support for two-model engine KV cache reuse
  - Add Eagle-3 support for qwen3 dense model
  - Migrate Eagle-3 and draft/target speculation to Drafter
  - Enable guided decoding with overlap scheduler
  - Support n-gram speculative decoding with disagg
  - Add beam search support to the PyTorch Workflow
  - Add LLGuidance Support for PyTorch Backend
  - Add NGrams V2 support
  - Add MTP support for Online EPLB
  - Support disaggregated serving in TRTLLM Sampler
  - Add core infrastructure to enable loading of custom checkpoint formats
  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
  - Use huge page mapping for host accessible memory on GB200
  - Add user-provided speculative decoding support
  - Add streaming scaffolding_llm.generate_async support
  - Detokenize option in /v1/completions request
  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
  - Remove support for llmapi + TRT backend in Triton
  - Add request_perf_metrics to triton LLMAPI backend
  - Add support for Triton request cancellation

- Benchmark:
  - Add support for benchmarking individual gemms in MOE benchmark (#6080)
  - Add speculative metrics for trtllm-bench
  - Add the ability to write a request timeline for trtllm-bench
  - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
  - Add latency support for trtllm-bench
  - Add Acceptance Rate calculation to benchmark_serving
  - Add wide-ep benchmarking scripts
  - Update trtllm-bench to support new Pytorch default
  - Add support for TRTLLM CustomDataset
  - Make benchmark_serving part of the library

- Documentation:
  - Refactored the doc structure to focus on the PyTorch workflow.
  - Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
  - Removed legacy documentation related to the TensorRT workflow.

### Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
- The dependent NVIDIA ModelOpt version is updated to 0.33.
- The dependent xgrammar version is updated to 0.1.21.
- The dependent transformers version is updated to 4.53.1.

### API Changes
- **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
- **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
- **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
- **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
- **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
- Add back allreduce_strategy parameter into TorchLlmArgs
- Add LLmArgs option to force using dynamic quantization
- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
- Remove deprecated LoRA LLM args, that are already specified in lora_config
- Add request_perf_metrics to LLMAPI
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- Remove TrtGptModelOptionalParams
- Remove ptuning knobs from TorchLlmArgs


### Fixed Issues
- Fix illegal memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Fix disagg + speculative decoding (#5558)
- Fix test_generate_with_seed CI failure. (#5772)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Fix the unexpected keyword argument 'streaming' (#5436)

### Known Issues
- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
- For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable `export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1`.

## TensorRT-LLM Release 0.21.0

### Key Features and Enhancements
- **Model Support**
  - Added Gemma3 VLM support
- **Features**
  - Added large-scale EP support
  - Integrated NIXL into the communication layer of the disaggregated service
  - Added fabric Memory support for KV Cache Transfer
  - Added MCP in ScaffoldingLLM
  - Added support for w4a8_mxfp4_fp8 quantization
  - Added support for fp8 rowwise quantization
  - Added generation logits support in TRTLLM Sampler
  - Added log probs support in TRTLLM Sampler
  - Optimized TRTLLM Sampler perf single beam single step
  - Enabled Disaggregated serving for Qwen-3
  - Added EAGLE3 support for Qwen-3
  - Fused finalize and allreduce for Qwen-MoE model
  - Refactored Fused MoE module
  - Added support for chunked attention on Blackwell and Hopper
  - Introduced sliding-window attention kernels for the generation phase on Blackwell
  - Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
  - Added FP8 block-scale GEMM support on SM89
  - Enabled overlap scheduler between draft forwards
  - Added Piecewise cuda graph support for MLA
  - Added model-agnostic one-engine eagle3
  - Enabled Finalize + Allreduce + add + rmsnorm fusion
  - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
  - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
  - Validated Llama 3.1 models on H200 NVL
- Benchmark:
  - Added all_reduce.py benchmark script for testing
  - Added beam width to trtllm-bench latency command
  - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
  - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
  - Supported post_proc for bench
  - Added no_kv_cache_reuse option and streaming support for trtllm serve bench

### Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`.
- The dependent public PyTorch version is updated to 2.7.1.
- The dependent TensorRT version is updated to 10.11.
- The dependent NVIDIA ModelOpt version is updated to 0.31.
- The dependent NCCL version is updated to 2.27.5.

### API Changes
- Set _AutoDeployLlmArgs as primary config object
- Removed decoder request from decoder interface
- Enhanced the torch_compile_config in llm args
- Removed the redundant use_kv_cache field from PytorchConfig
- Moved allreduce_strategy from committed api to reference

### Fixed Issues
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Fixed cuda graph padding for spec decoding (#4853)
- Fixed llama 4 long context issue (#4809)
- Fixed max_num_sequences calculation with overlap scheduling (#4532)
- Fixed chunked prefill + overlap scheduling (#5761)
- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
- Fixed index out of bounds error in spec decoding (#5954)
- Fixed MTP illegal memory access in cuda graph warmup (#5947)
- Fixed no free slots error with spec decode + disagg (#5975)
- Fixed one-off attention window size for Gemma3 1B (#5564)

### Known Issues
- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
- In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

## TensorRT-LLM Release 0.20.0

### Key Features and Enhancements
- **Model Support**
  - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
  - Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to `examples/models/contrib/hyperclovax/README.md`
  - Added Dynasor-CoT in scaffolding examples. Refer to `examples/scaffolding/contrib/Dynasor/README.md`
  - Added Mistral Small 3.1 24B VLM support in TRT workflow
  - Added Gemma3-1b-it support in PyTorch workflow
  - Added Nemotron-H model support
  - Added Eagle-3 support for LLAMA4
- **PyTorch workflow**
  - Added lora support
  - Added return logits support
  - Adopt new logprob definition in PyTorch flow
  - Enabled per-request stats with PyTorch backend
  - Enabled LogitsProcessor in PyTorch backend
- Benchmark:
  - Add beam width to low latency.
  - Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
  - Remove deprecated Python runtime benchmark
  - Add benchmark support for scaffolding
- Multimodal models
  - Added support in trtllm-serve
  - Added support in trtllm-bench, the support is limited to image only for now
- Supported DeepSeek-R1 W4A8 on Hopper
- Add the RTX Pro 6000 support on single GPU
- Integrated Llama4 input processor
- Added CGA reduction FHMA kernels on Blackwell
- Enabled chunked context for FlashInfer
- Supported KV cache reuse for MLA
- Added Piecewise CUDA Graph support
- Supported multiple LoRA adapters and TP
- Added KV cache-aware router for disaggregated serving
- Unfused attention for native support
- Added group_rms_norm kernel to normalize multiple inputs in a single operator
- Added smart router for the MoE module
- Added head size 72 support for QKV preprocessing kernel
- Added MNNVL MoE A2A support
- Optimized Large Embedding Tables in Multimodal Models
- Supported Top-K logprobs and prompt_logprobs in LLMAPI
- Enabled overlap scheduler in TRT workflow via executor API

### Infrastructure Changes
- **TRT-LLM team formally releases docker image on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)**.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
- The dependent TensorRT version is updated to 10.10.0
- The dependent CUDA version is updated to 12.9.0
- The dependent public PyTorch version is updated to 2.7.0
- The dependent NVIDIA ModelOpt version is updated to 0.29.0
- The dependent NCCL version is maintained at 2.25.1
- Open-sourced XQA kernels
- Dependent datasets version was upgraded to 3.1.0
- Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
- Downgrade gcc toolset version from 13 to 11

### API Changes
- [Breaking Change]:Enable scheduling overlap by default
- Remove deprecated GptSession/V1 from TRT workflow
- Set _AutoDeployLlmArgs as primary config object
- Allow overriding CLI arguments with YAML file in trtllm-serve
- Introduced multimodal embedding field in LlmRequest


### Fixed Issues
- Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
- Fix C++ decoder synchronization in PyTorch (#3106)
- Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
- Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
- Fix attention DP bug on Qwen3 MoE model (#4141)
- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
- Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)

### Known Issues
- multi-GPU model support on RTX Pro 6000


## TensorRT-LLM Release 0.19.0

### Key Features and Enhancements
  - **The C++ runtime is now open sourced.**
  - **PyTorch workflow**
    - Added DeepSeek V3/R1 support. Refer to `examples/deepseek_v3/README.md`, also to the blog `docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md`.
    - Added Llava-Next support.
    - Added BERT support.
    - Added a C++ based decoder, which added support for:
      - TopK / TopP.
      - Bad words.
      - Stop words.
      - Embedding bias.
    - Added Autotuner for custom-op-compatible tuning process.
      - Added a Python-based Autotuner core framework for kernel tuning.
      - Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
    - Added guided decoding support (XGrammar integration).
    - Added pipeline parallelism support for the overlap scheduler in `PyExecutor`.
    - Added Qwen2VL model support.
    - Added mixed precision quantization support.
    - Added pipeline parallelism with attention DP support.
    - Added no-cache attention support.
    - Added `PeftCacheManager` support.
    - Added Qwen2.5‑VL support and refactored Qwen2‑VL.
    - Added trtllm‑gen FP4 GEMM support.
    - Added Qwen2 MoE support.
    - Applied `AutoTuner` to both Fused MoE and NVFP4 Linear operators.
    - Introduced a `UserBuffers` allocator.
    - Added Deepseek eager mode AllReduce fusion support.
    - Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of `examples/deepseek_v3/README.md`.
    - Added FlashMLA support for SM90.
    - Added support for enabling MTP with CUDA graph padding.
    - Added initial EAGLE-3 implementation.
    - Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
  - **AutoDeploy for PyTorch workflow**.
    - The AutoDeploy for PyTorch workflow is an **experimental** feature in `tensorrt_llm._torch.auto_deploy`.
    - AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
    - Check out `examples/auto_deploy/README.md` for more details.
  - LLM API
    - [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
    - Added batched logits processor support.
    - Added EAGLE support.
    - Added abort request support.
    - Added `get_stats` support.
    - Added multi-node support for Slurm-based clusters, refer to `examples/llm-api/llm_mgmn_*.sh`.
  - Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in `examples/multimodal/README.md`.
  - Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in `examples/mixtral/README.md`.
  - Added Qwen2-Audio support. Refer to `examples/qwen2audio/README.md`.
  - Added Language-Adapter support. Refer to `examples/language_adapter/README.md`.
  - Added STDiT for OpenSoRA text-to-video support. Refer to `examples/stdit/README.md`.
  - Added vision encoders with tensor parallelism and context parallelism support. Refer to `examples/vit/README.md`.
  - Added EXAONE-Deep support. Refer to `examples/exaone/README.md`.
  - Added support for Phi-4-mini and Phi‑4‑MM.
  - Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at `examples/gemma/README.md`.
  - Added FP8 quantization support for Qwen2-VL.
  - Added batched inference support for the LLM API MMLU example `examples/mmlu_llmapi.py`.
  - Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
  - Added Mamba-Hybrid support.
  - Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
  - Added a `--quantize_lm_head` option `examples/quantization/quantize.py` to support `lm_head` quantization.
  - Added batched tensor FP4 quantization support.
  - Added a `/metrics` endpoint for `trtllm-serve` to log iteration statistics.
  - Added LoRA support for Phi-2 model.
  - Added returning context logits support for `trtllm-serve`.
  - Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
  - Added request BW metric measurement for `disaggServerBenchmark`.
  - Updated logits bitmask kernel to v3.
  - Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
  - Added iteration log support for `trtllm-bench`.
  - `fp8_blockscale_gemm` is now open-sourced.
  - Added AWQ support for ModelOpt checkpoints.
  - Added Linear block scale layout support in FP4 quantization.
  - Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
  - Added Variable-Beam-Width-Search (VBWS) support (part2).
  - Added LoRA support for Gemma.
  - Refactored scaffolding worker, added OpenAI API worker support.
  - Optionally split MoE inputs into chunks to reduce GPU memory usage.
  - Added UCX IP interface support.
  - [BREAKING CHANGE] Added output of first token to additional generation outputs.
  - Added FP8 support for SM120 architecture.
  - Registered `ENABLE_MULTI_DEVICE` and `ENABLE_UCX` as CMake options.
  - Made the scaffolding Controller more generic.
  - Breaking change: Added individual gatherContext support for each additional output.
  - Enabled `PyExecutor` inference flow to estimate `max_num_tokens` for `kv_cache_manager`.
  - Added `TLLM_OVERRIDE_LAYER_NUM` and `TLLM_TRACE_MODEL_FORWARD` environment variables for debugging.
  - Supported aborting disconnected requests.
  - Added an option to run disaggregated serving without context servers.
  - Fixed and improved allreduce and fusion kernels.
  - Enhanced the integrated robustness of scaffolding via `init.py`.

### API Changes
  - Exposed `kc_cache_retention_config` from C++ `executor` API to the LLM API.
  - Moved `BuildConfig` arguments to `LlmArgs`.
  - Removed speculative decoding parameters from stateful decoders.
  - Exposed `DecoderState` via bindings and integrated it in decoder.
  - Refactored the `LlmArgs` with `Pydantic` and migrated remaining pybinding configurations to Python.
  - Refactored disaggregated serving scripts.
  - Added `numNodes` to `ParallelConfig`.
  - Redesigned the multi‑stream API for DeepSeek.

### Fixed Issues
  - Fixed misused length argument of PluginField. Thanks to the contribution from @jl749 in #2712. This also fixes #2685.
  - Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
  - Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
  - Fixed incorrect batch slot usage in `addCumLogProbs` kernel. Thanks to the contribution from @aotman in #2787.
  - Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
  - Removed the necessary of `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm`.

### Infrastructure Changes
  - The dependent NVIDIA ModelOpt version is updated to 0.27.

### Known Issues
  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.


## TensorRT-LLM Release 0.18.2

### Key Features and Enhancements
  - This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.


## TensorRT-LLM Release 0.18.1

### Key Features and Enhancements
  - **The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases**.

### Infrastructure Changes
  - The dependent `transformers` package version is updated to 4.48.3.


## TensorRT-LLM Release 0.18.0

### Key Features and Enhancements
  - **Features that were previously available in the 0.18.0.dev pre-releases are not included in this release**.
  - [BREAKING CHANGE] Windows platform support is deprecated as of v0.18.0. All Windows-related code and functionality will be completely removed in future releases.

### Known Issues
  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

### Infrastructure Changes
  - The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.03-py3`.
  - The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.03-py3`.
  - The dependent TensorRT version is updated to 10.9.
  - The dependent CUDA version is updated to 12.8.1.
  - The dependent NVIDIA ModelOpt version is updated to 0.25 for Linux platform.


## TensorRT-LLM Release 0.17.0

### Key Features and Enhancements
  - **Blackwell support**
    - **NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.**
    - Added support for B200.
    - Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
    - Added NVFP4 Gemm support for Llama and Mixtral models.
    - Added NVFP4 support for the `LLM` API and `trtllm-bench` command.
    - GB200 NVL is not fully supported.
    - Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
  - **PyTorch workflow**
    - The PyTorch workflow is an **experimental** feature in `tensorrt_llm._torch`. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
    - Added support for H100/H200/B200.
    - Added support for Llama models, Mixtral, QWen, Vila.
    - Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
    - Added custom context and decoding attention kernels support via PyTorch custom op.
    - Added support for chunked context (default off).
    - Added CudaGraph support for decoding only.
    - Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
  - Added FP8 context FMHA support for the W4A8 quantization workflow.
  - Added ModelOpt quantized checkpoint support for the `LLM` API.
  - Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in `examples/multimodal/README.md`.
  - Added PDL support for `userbuffer` based AllReduce-Norm fusion kernel.
  - Added runtime support for seamless lookahead decoding.
  - Added token-aligned arbitrary output tensors support for the C++ `executor` API.

### API Changes
  - [BREAKING CHANGE] KV cache reuse is enabled automatically when `paged_context_fmha` is enabled.
  - Added `--concurrency` support for the `throughput` subcommand of `trtllm-bench`.

### Known Issues
  - Need `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm` due to new third-party dependencies.
  - The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related [GitHub issue](https://github.com/pytorch/pytorch/issues/144966).
  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

### Fixed Issues
  - Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
  - Added NVIDIA H200 GPU into the `cluster_key` for auto parallelism feature. (#2552)
  - Fixed a typo in the `__post_init__` function of `LLmArgs` Class. Thanks for the contribution from @topenkoff in #2691.
  - Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
  - Fixed Deepseek-V2 model accuracy.

### Infrastructure Changes
  - The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.01-py3`.
  - The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.01-py3`.
  - The dependent TensorRT version is updated to 10.8.0.
  - The dependent CUDA version is updated to 12.8.0.
  - The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.


## TensorRT-LLM Release 0.16.0

### Key Features and Enhancements
  - Added guided decoding support with XGrammar backend.
  - Added quantization support for RecurrentGemma. Refer to `examples/recurrentgemma/README.md`.
  - Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at `examples/llama/README.md`.
  - Added W4A8 quantization support to BF16 models on Ada (SM89).
  - Added PDL support for the FP8 GEMM plugins.
  - Added a runtime `max_num_tokens` dynamic tuning feature, which can be enabled by setting `--enable_max_num_tokens_tuning` to `gptManagerBenchmark`.
  - Added typical acceptance support for EAGLE.
  - Supported chunked context and sliding window attention to be enabled together.
  - Added head size 64 support for the XQA kernel.
  - Added the following features to the LLM API:
    - Lookahead decoding.
    - DeepSeek V1 support.
    - Medusa support.
    - `max_num_tokens` and `max_batch_size` arguments to control the runtime parameters.
    - `extended_runtime_perf_knob_config` to enable various performance configurations.
  - Added LogN scaling support for Qwen models.
  - Added `AutoAWQ` checkpoints support for Qwen. Refer to the “INT4-AWQ” section in `examples/qwen/README.md`.
  - Added `AutoAWQ` and `AutoGPTQ` Hugging Face checkpoints support for LLaMA. (#2458)
  - Added `allottedTimeMs` to the C++ `Request` class to support per-request timeout.
  - [BREAKING CHANGE] Removed NVIDIA V100 GPU support.

### API Changes
  - [BREAKING CHANGE] Removed `enable_xqa` argument from `trtllm-build`.
  - [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
  - [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag `--use_embedding_sharing` from convert checkpoints scripts.
  - [BREAKING CHANGE] The `if __name__ == "__main__"` entry point is required for both single-GPU and multi-GPU cases when using the `LLM` API.
  - [BREAKING CHANGE] Cancelled requests now return empty results.
  - Added the `enable_chunked_prefill` flag to the `LlmArgs` of the `LLM` API.
  - Integrated BERT and RoBERTa models to the `trtllm-build` command.

### Model Updates
  - Added Qwen2-VL support. Refer to the “Qwen2-VL” section of `examples/multimodal/README.md`.
  - Added multimodal evaluation examples. Refer to `examples/multimodal`.
  - Added Stable Diffusion XL support. Refer to `examples/sdxl/README.md`. Thanks for the contribution from @Zars19 in #1514.

### Fixed Issues
  - Fixed unnecessary batch logits post processor calls. (#2439)
  - Fixed a typo in the error message. (#2473)
  - Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
  - Fixed `sampling_params` to only be setup if `end_id` is None and `tokenizer` is not None in the `LLM` API. Thanks to the contribution from @mfuntowicz in #2573.

### Infrastructure Changes
  - Updated the base Docker image for TensorRT-LLM to `nvcr.io/nvidia/pytorch:24.11-py3`.
  - Updated the base Docker image for TensorRT-LLM Backend to `nvcr.io/nvidia/tritonserver:24.11-py3`.
  - Updated to TensorRT v10.7.
  - Updated to CUDA v12.6.3.
  - Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
  - Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

### Known Issues
  - There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by `export NCCL_P2P_LEVEL=SYS`.

## TensorRT-LLM Release 0.15.0

### Key Features and Enhancements
  - Added support for EAGLE. Refer to `examples/eagle/README.md`.
  - Added functional support for GH200 systems.
  - Added AutoQ (mixed precision) support.
  - Added a `trtllm-serve` command to start a FastAPI based server.
  - Added FP8 support for Nemotron NAS 51B. Refer to `examples/nemotron_nas/README.md`.
  - Added INT8 support for GPTQ quantization.
  - Added TensorRT native support for INT8 Smooth Quantization.
  - Added quantization support for Exaone model. Refer to `examples/exaone/README.md`.
  - Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in `examples/medusa/README.md`.
  - Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
  - Added support for `Qwen2ForSequenceClassification` model architecture.
  - Added Python plugin support to simplify plugin development efforts. Refer to `examples/python_plugin/README.md`.
  - Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
  - Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in `docs/source/performance/perf-best-practices.md` for information about the required conditions for embedding sharing.
  - Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
  - Extended the maximum supported `beam_width` to `256`.
  - Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to `examples/multimodal/README.md`.
  - Added support for prompt-lookup speculative decoding. Refer to `examples/prompt_lookup/README.md`.
  - Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in `examples/llama/README.md`.
  - Added a C++ example for fast logits using the `executor` API. Refer to “executorExampleFastLogits” section in `examples/cpp/executor/README.md`.
  - [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
  - Added the following enhancements to the [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html):
    - [BREAKING CHANGE] Moved the runtime initialization from the first invocation of `LLM.generate` to `LLM.__init__` for better generation performance without warmup.
    - Added `n` and `best_of` arguments to the `SamplingParams` class. These arguments enable returning multiple generations for a single request.
    - Added `ignore_eos`, `detokenize`, `skip_special_tokens`, `spaces_between_special_tokens`, and `truncate_prompt_tokens` arguments to the `SamplingParams` class. These arguments enable more control over the tokenizer behavior.
    - Added support for incremental detokenization to improve the detokenization performance for streaming generation.
    - Added the `enable_prompt_adapter` argument to the `LLM` class and the `prompt_adapter_request` argument for the `LLM.generate` method. These arguments enable prompt tuning.
  - Added support for a `gpt_variant` argument to the `examples/gpt/convert_checkpoint.py` file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.

### API Changes
  - [BREAKING CHANGE] Moved the flag `builder_force_num_profiles` in `trtllm-build` command to the `BUILDER_FORCE_NUM_PROFILES` environment variable.
  - [BREAKING CHANGE] Modified defaults for `BuildConfig` class so that they are aligned with the `trtllm-build` command.
  - [BREAKING CHANGE] Removed Python bindings of `GptManager`.
  - [BREAKING CHANGE] `auto` is used as the default value for `--dtype` option in quantize and checkpoints conversion scripts.
  - [BREAKING CHANGE] Deprecated `gptManager` API path in `gptManagerBenchmark`.
  - [BREAKING CHANGE] Deprecated the `beam_width` and `num_return_sequences` arguments to the `SamplingParams` class in the LLM API. Use the `n`, `best_of` and `use_beam_search` arguments instead.
  - Exposed `--trust_remote_code` argument to the OpenAI API server. (#2357)

### Model Updates
  - Added support for Llama 3.2 and llama 3.2-Vision model. Refer to `examples/mllama/README.md` for more details on the llama 3.2-Vision model.
  - Added support for Deepseek-v2. Refer to `examples/deepseek_v2/README.md`.
  - Added support for Cohere Command R models. Refer to `examples/commandr/README.md`.
  - Added support for Falcon 2,  refer to `examples/falcon/README.md`, thanks to the contribution from @puneeshkhanna in #1926.
  - Added support for InternVL2. Refer to `examples/multimodal/README.md`.
  - Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
  - Added support for Minitron. Refer to `examples/nemotron`.
  - Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in `examples/gpt/README.md`.
  - Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in `examples/multimodal/README.md`.

### Fixed Issues
  - Fixed a slice error in forward function. (#1480)
  - Fixed an issue that appears when building BERT. (#2373)
  - Fixed an issue that model is not loaded when building BERT. (2379)
  - Fixed the broken executor examples. (#2294)
  - Fixed the issue that the kernel `moeTopK()` cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
  - Fixed an assertion failure on `crossKvCacheFraction`. (#2419)
  - Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
  - Fixed a PDL typo in `docs/source/performance/perf-benchmarking.md`, thanks @MARD1NO for pointing it out in #2425.

### Infrastructure Changes
  - The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.10-py3`.
  - The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.10-py3`.
  - The dependent TensorRT version is updated to 10.6.
  - The dependent CUDA version is updated to 12.6.2.
  - The dependent PyTorch version is updated to 2.5.1.
  - The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.

### Documentation
  - Added a copy button for code snippets in the documentation. (#2288)


## TensorRT-LLM Release 0.14.0

### Key Features and Enhancements
  - Enhanced the `LLM` class in the [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html).
    - Added support for calibration with offline dataset.
    - Added support for Mamba2.
    - Added support for `finish_reason` and `stop_reason`.
  - Added FP8 support for CodeLlama.
  - Added `__repr__` methods for class `Module`, thanks to the contribution from @1ytic in #2191.
  - Added BFloat16 support for fused gated MLP.
  - Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
  - Improved `customAllReduce` performance.
  - Draft model now can copy logits directly over MPI to the target model's process in `orchestrator` mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
  - NVIDIA Volta GPU support is deprecated and will be removed in a future release.

### API Changes
  - [BREAKING CHANGE] The default `max_batch_size` of the `trtllm-build` command is set to `2048`.
  - [BREAKING CHANGE] Remove `builder_opt` from the `BuildConfig` class and the `trtllm-build` command.
  - Add logits post-processor support to the `ModelRunnerCpp` class.
  - Added `isParticipant` method to the C++ `Executor` API to check if the current process is a participant in the executor instance.

### Model Updates
  - Added support for NemotronNas, see `examples/nemotron_nas/README.md`.
  - Added support for Deepseek-v1, see `examples/deepseek_v1/README.md`.
  - Added support for Phi-3.5 models, see `examples/phi/README.md`.

### Fixed Issues
  - Fixed a typo in `tensorrt_llm/models/model_weights_loader.py`, thanks to the contribution from @wangkuiyi in #2152.
  - Fixed duplicated import module in `tensorrt_llm/runtime/generation.py`, thanks to the contribution from @lkm2835 in #2182.
  - Enabled `share_embedding` for the models that have no `lm_head` in legacy  checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232.
  - Fixed `kv_cache_type` issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219.
  - Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
  - Fixed an issue surrounding `trtllm-build --fast-build` with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135.
  - Fixed missing `use_fused_mlp` when constructing `BuildConfig` from dict, thanks for the fix from @ethnzhng in #2081.
  - Fixed lookahead batch layout for `numNewTokensCumSum`. (#2263)

### Infrastructure Changes
  - The dependent ModelOpt version is updated to v0.17.

### Documentation
  - @Sherlock113 added a [tech blog](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml) to the latest news in #2169, thanks for the contribution.

### Known Issues
  - Replit Code is not supported with the transformers 4.45+


## TensorRT-LLM Release 0.13.0

### Key Features and Enhancements
  - Supported lookahead decoding (experimental), see `docs/source/speculative_decoding.md`.
  - Added some enhancements to the `ModelWeightsLoader` (a unified checkpoint converter, see `docs/source/architecture/model-weights-loader.md`).
    -  Supported Qwen models.
    -  Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
    -  Improved performance on `*.bin` and `*.pth`.
  - Supported OpenAI Whisper in C++ runtime.
  - Added some enhancements to the `LLM` class.
    - Supported LoRA.
    - Supported engine building using dummy weights.
    - Supported `trust_remote_code` for customized models and tokenizers downloaded from Hugging Face Hub.
  - Supported beam search for streaming mode.
  - Supported tensor parallelism for Mamba2.
  - Supported returning generation logits for streaming mode.
  - Added `curand` and `bfloat16` support for `ReDrafter`.
  - Added sparse mixer normalization mode for MoE models.
  - Added support for QKV scaling in FP8 FMHA.
  - Supported FP8 for MoE LoRA.
  - Supported KV cache reuse for P-Tuning and LoRA.
  - Supported in-flight batching for CogVLM models.
  - Supported LoRA for the `ModelRunnerCpp` class.
  - Supported `head_size=48` cases for FMHA kernels.
  - Added FP8 examples for DiT models, see `examples/dit/README.md`.
  - Supported decoder with encoder input features for the C++ `executor` API.

### API Changes
  - [BREAKING CHANGE] Set `use_fused_mlp` to `True` by default.
  - [BREAKING CHANGE] Enabled `multi_block_mode` by default.
  - [BREAKING CHANGE] Enabled `strongly_typed` by default in `builder` API.
  - [BREAKING CHANGE] Renamed `maxNewTokens`, `randomSeed` and `minLength` to `maxTokens`, `seed` and `minTokens` following OpenAI style.
  - The `LLM` class
    - [BREAKING CHANGE] Updated `LLM.generate` arguments to include `PromptInputs` and `tqdm`.
  - The C++ `executor` API
    - [BREAKING CHANGE] Added `LogitsPostProcessorConfig`.
    - Added `FinishReason` to `Result`.

### Model Updates
  - Supported Gemma 2, see "Run Gemma 2" section in `examples/gemma/README.md`.

### Fixed Issues
  - Fixed an accuracy issue when enabling remove padding issue for cross attention. (#1999)
  - Fixed the failure in converting qwen2-0.5b-instruct when using `smoothquant`. (#2087)
  - Matched the `exclude_modules` pattern in `convert_utils.py` to the changes in `quantize.py`. (#2113)
  - Fixed build engine error when `FORCE_NCCL_ALL_REDUCE_STRATEGY` is set.
  - Fixed unexpected truncation in the quant mode of `gpt_attention`.
  - Fixed the hang caused by race condition when canceling requests.
  - Fixed the default factory for `LoraConfig`. (#1323)

### Infrastructure Changes
  - Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.07-py3`.
  - Base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.07-py3`.
  - The dependent TensorRT version is updated to 10.4.0.
  - The dependent CUDA version is updated to 12.5.1.
  - The dependent PyTorch version is updated to 2.4.0.
  - The dependent ModelOpt version is updated to v0.15.


## TensorRT-LLM Release 0.12.0

### Key Features and Enhancements
  - Supported LoRA for MoE models.
  - The `ModelWeightsLoader` is enabled for LLaMA family models (experimental), see `docs/source/architecture/model-weights-loader.md`.
  - Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
  - Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the `LLM` class.
  - Supported FP8 OOTB MoE.
  - Supported Starcoder2 SmoothQuant. (#1886)
  - Supported ReDrafter Speculative Decoding, see “ReDrafter” section in `docs/source/speculative_decoding.md`.
  - Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in #1834.
  - Added in-flight batching support for GLM 10B model.
  - Supported `gelu_pytorch_tanh` activation function, thanks to the contribution from @ttim in #1897.
  - Added `chunk_length` parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909.
  - Added `concurrency` argument for `gptManagerBenchmark`.
  - Executor API supports requests with different beam widths, see `docs/source/executor.md#sending-requests-with-different-beam-widths`.
  - Added the flag `--fast_build` to `trtllm-build` command (experimental).

### API Changes
  - [BREAKING CHANGE] `max_output_len` is removed from `trtllm-build` command, if you want to limit sequence length on engine build stage, specify `max_seq_len`.
  - [BREAKING CHANGE] The `use_custom_all_reduce` argument is removed from `trtllm-build`.
  - [BREAKING CHANGE] The `multi_block_mode` argument is moved from build stage (`trtllm-build` and builder API) to the runtime.
  - [BREAKING CHANGE] The build time argument `context_fmha_fp32_acc` is moved to runtime for decoder models.
  - [BREAKING CHANGE] The arguments `tp_size`, `pp_size` and `cp_size` is removed from `trtllm-build` command.
  - The C++ batch manager API is deprecated in favor of the C++ `executor` API, and it will be removed in a future release of TensorRT-LLM.
  - Added a version API to the C++ library, a `cpp/include/tensorrt_llm/executor/version.h` file is going to be generated.

### Model Updates
  - Supported LLaMA 3.1 model.
  - Supported Mamba-2 model.
  - Supported EXAONE model, see `examples/exaone/README.md`.
  - Supported Qwen 2 model.
  - Supported GLM4 models, see `examples/chatglm/README.md`.
  - Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in `examples/multimodal/README.md`.

### Fixed Issues
  - Fixed wrong pad token for the CodeQwen models. (#1953)
  - Fixed typo in `cluster_infos` defined in `tensorrt_llm/auto_parallel/cluster_info.py`, thanks to the contribution from @saeyoonoh in #1987.
  - Removed duplicated flags in the command at `docs/source/reference/troubleshooting.md`, thanks for the contribution from @hattizai in #1937.
  - Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in #2039. (#2040)
  - Fixed the failure when converting the checkpoint for Mistral Nemo model. (#1985)
  - Propagated `exclude_modules` to weight-only quantization, thanks to the contribution from @fjosw in #2056.
  - Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in #2028.
  - Fixed some typos in the documentation, thanks to the contribution from @lfz941 in #1939.
  - Fixed the engine build failure when deduced `max_seq_len` is not an integer. (#2018)

### Infrastructure Changes
  - Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.07-py3`.
  - Base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.07-py3`.
  - The dependent TensorRT version is updated to 10.3.0.
  - The dependent CUDA version is updated to 12.5.1.
  - The dependent PyTorch version is updated to 2.4.0.
  - The dependent ModelOpt version is updated to v0.15.0.

### Known Issues

- On Windows, installation of TensorRT-LLM may succeed, but you might hit `OSError: exception: access violation reading 0x0000000000000000` when importing the library in Python.


## TensorRT-LLM Release 0.11.0

### Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in `examples/llama/README.md`).
- Low latency optimization
  - Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
  - Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
  - Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
  - Supported running FP8 LLaMA with FP16 LoRA checkpoints.
  - Added support for quantized base model and FP16/BF16 LoRA.
    - SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA​
    - INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA​
    - Weight-Only Group-wise + FP16/BF16/FP32 LoRA
  - Added LoRA support to Qwen2, see “Run models with LoRA” section in `examples/qwen/README.md`.
  - Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in `examples/phi/README.md`.
  - Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in `examples/gpt/README.md`.
- Encoder-decoder models C++ runtime enhancements
  - Supported paged KV cache and inflight batching. (#800)
  - Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to `distil-whisper/distil-large-v3`, thanks to the contribution from @IbrahimAmin1 in #1337.
- Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in #1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added `numQueuedRequests` to the iteration stats log of the executor API.
- Added `iterLatencyMilliSec` to the iteration stats log of the executor API.
- Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in #1674.

### API Changes
- [BREAKING CHANGE] `trtllm-build` command
  - Migrated Whisper to unified workflow (`trtllm-build` command), see documents: examples/whisper/README.md.
  - `max_batch_size` in `trtllm-build` command is switched to 256 by default.
  - `max_num_tokens` in `trtllm-build` command is switched to 8192 by default.
  - Deprecated `max_output_len` and added `max_seq_len`.
  - Removed unnecessary `--weight_only_precision` argument from `trtllm-build` command.
  - Removed `attention_qk_half_accumulation` argument from `trtllm-build` command.
  - Removed `use_context_fmha_for_generation` argument from `trtllm-build` command.
  - Removed `strongly_typed` argument from `trtllm-build` command.
  - The default value of `max_seq_len` reads from the HuggingFace mode config now.
- C++ runtime
  - [BREAKING CHANGE] Renamed `free_gpu_memory_fraction` in `ModelRunnerCpp` to `kv_cache_free_gpu_memory_fraction`.
  - [BREAKING CHANGE] Refactored `GptManager` API
    - Moved `maxBeamWidth` into `TrtGptModelOptionalParams`.
    - Moved `schedulerConfig` into `TrtGptModelOptionalParams`.
  - Added some more options to `ModelRunnerCpp`, including `max_tokens_in_paged_kv_cache`, `kv_cache_enable_block_reuse` and `enable_chunked_context`.
- [BREAKING CHANGE] Python high-level API
  - Removed the `ModelConfig` class, and all the options are moved to `LLM` class.
  - Refactored the `LLM` class, please refer to `examples/high-level-api/README.md`
    - Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
    - Exposed `model` to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.
    - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
    - Support build cache to reuse the built TensorRT-LLM engines by setting environment variable `TLLM_LLMAPI_BUILD_CACHE=1` or passing `enable_build_cache=True` to `LLM` class.
    - Exposed low-level options including `BuildConfig`, `SchedulerConfig` and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
  - Refactored `LLM.generate()` and `LLM.generate_async()` API.
    - Removed `SamplingConfig`.
    - Added `SamplingParams` with more extensive parameters, see `tensorrt_llm/llmapi/utils.py`.
      - The new `SamplingParams` contains and manages fields from Python bindings of `SamplingConfig`, `OutputConfig`, and so on.
    - Refactored `LLM.generate()` output as `RequestOutput`, see `tensorrt_llm/llmapi/llm.py`.
  - Updated the `apps` examples, specially by rewriting both `chat.py` and `fastapi_server.py` using the `LLM` APIs, please refer to the `examples/apps/README.md` for details.
    - Updated the `chat.py` to support multi-turn conversation, allowing users to chat with a model in the terminal.
    - Fixed the `fastapi_server.py` and eliminate the need for `mpirun` in multi-GPU scenarios.
- [BREAKING CHANGE] Speculative decoding configurations unification
  - Introduction of `SpeculativeDecodingMode.h` to choose between different speculative decoding techniques.
  - Introduction of `SpeculativeDecodingModule.h` base class for speculative decoding techniques.
  - Removed `decodingMode.h`.
- `gptManagerBenchmark`
  - [BREAKING CHANGE] `api` in `gptManagerBenchmark` command is `executor` by default now.
  - Added a runtime `max_batch_size`.
  - Added a runtime `max_num_tokens`.
- [BREAKING CHANGE] Added a `bias` argument to the `LayerNorm` module, and supports non-bias layer normalization.
- [BREAKING CHANGE] Removed `GptSession` Python bindings.

### Model Updates
- Supported Jais, see `examples/jais/README.md`.
- Supported DiT, see `examples/dit/README.md`.
- Supported VILA 1.5.
- Supported Video NeVA, see `Video NeVA`section in `examples/multimodal/README.md`.
- Supported Grok-1, see `examples/grok/README.md`.
- Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in #1392.
- Supported Phi-3-medium models, see `examples/phi/README.md`.
- Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.

### Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (#1539)
- Fixed `top_k` type in `executor.py`, thanks to the contribution from @vonjackustc in #1329.
- Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in #1486.
- Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in #1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in #1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in #1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in #1650.
- Fixed `qkv_bias` shape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637.
- Fixed the error of Ada traits for `fpA_intB`, thanks to the contribution from @JamesTheZ  in #1583.
- Update `examples/qwenvl/requirements.txt`, thanks to the contribution from @ngoanpv in #1248.
- Fixed rsLoRA scaling in `lora_manager`, thanks to the contribution from @TheCodeWrangler in #1669.
- Fixed Qwen1.5 checkpoint convert failure #1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in #1535.
- Fixed `convert_hf_mpt_legacy` call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534.
- Fixed `use_fp8_context_fmha` broken outputs (#1539).
- Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in #1742.
- Fixed stop words and bad words in python bindings. (#1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: #1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: #1738
- Fixed the failure when `quantize.py` is export data to config.json, thanks to the contribution from @janpetrov: #1676
- Raise error when autopp detects unsupported quant plugin #1626.
- Fixed the issue that `shared_embedding_table` is not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz.
- Fixed stop and bad words list contiguous for `ModelRunner` #1815, thanks to the contribution from @Marks101.
- Fixed missing comment for `FAST_BUILD`, thanks to the support from @lkm2835 in #1851.
- Fixed the issues that Top-P sampling occasionally produces invalid tokens. #1590
- Fixed #1424.
- Fixed #1529.
- Fixed `benchmarks/cpp/README.md` for #1562 and #1552.
- Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: https://github.com/triton-inference-server/tensorrtllm_backend/pull/478, https://github.com/triton-inference-server/tensorrtllm_backend/pull/482 and https://github.com/triton-inference-server/tensorrtllm_backend/pull/449.

### Infrastructure Changes
  - Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.05-py3`.
  - Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.05-py3`.
  - The dependent TensorRT version is updated to 10.2.0.
  - The dependent CUDA version is updated to 12.4.1.
  - The dependent PyTorch version is updated to 2.3.1.
  - The dependent ModelOpt version is updated to v0.13.0.

### Known Issues

- In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of `OSError: exception: access violation reading 0x0000000000000000`. This issue is under investigation.


## TensorRT-LLM Release 0.10.0

### Announcements
- TensorRT-LLM supports TensorRT 10.0.1 and NVIDIA NGC 24.03 containers.

### Key Features and Enhancements
- The Python high level API
  - Added embedding parallel, embedding sharing, and fused MLP support.
  - Enabled the usage of the `executor` API.
- Added a weight-stripping feature with a new `trtllm-refit` command. For more information, refer to `examples/sample_weight_stripping/README.md`.
- Added a weight-streaming feature. For more information, refer to `docs/source/advanced/weight-streaming.md`.
- Enhanced the multiple profiles feature; `--multiple_profiles` argument in `trtllm-build` command builds more optimization profiles now for better performance.
- Added FP8 quantization support for Mixtral.
- Added support for pipeline parallelism for GPT.
- Optimized `applyBiasRopeUpdateKVCache` kernel by avoiding re-computation.
- Reduced overheads between `enqueue` calls of TensorRT engines.
- Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
- Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
- Added debug options (`--visualize_network` and `--dry_run`) to the `trtllm-build` command to visualize the TensorRT network before engine build.
- Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
- Improved the performance of pipeline parallelism when enabling in-flight batching.
- Supported quantization for Nemotron models.
- Added LoRA support for Mixtral and Qwen.
- Added in-flight batching support for ChatGLM models.
- Added support to `ModelRunnerCpp` so that it runs with the `executor` API for IFB-compatible models.
- Enhanced the custom `AllReduce` by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.
- Optimized the performance of checkpoint conversion process for LLaMA.
- Benchmark
  - [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to `gptManagerBenchmark`.
  - Enabled streaming and support `Time To the First Token (TTFT)` latency and `Inter-Token Latency (ITL)` metrics for `gptManagerBenchmark`.
  - Added the `--max_attention_window` option to `gptManagerBenchmark`.

### API Changes
- [BREAKING CHANGE] Set the default `tokens_per_block` argument of the `trtllm-build` command to 64 for better performance.
- [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
- [BREAKING CHANGE] Renamed `GptModelConfig` to `ModelConfig`.
- [BREAKING CHANGE] Added speculative decoding mode to the builder API.
- [BREAKING CHANGE] Refactor scheduling configurations
  - Unified the `SchedulerPolicy` with the same name in `batch_scheduler` and `executor`, and renamed it to `CapacitySchedulerPolicy`.
  - Expanded the existing configuration scheduling strategy from `SchedulerPolicy` to `SchedulerConfig` to enhance extensibility. The latter also introduces a chunk-based configuration called `ContextChunkingPolicy`.
- [BREAKING CHANGE] The input prompt was removed from the generation output in the `generate()` and `generate_async()` APIs. For example, when given a prompt as `A B`, the original generation result could be `<s>A B C D E` where only `C D E` is the actual output, and now the result is `C D E`.
- [BREAKING CHANGE] Switched default `add_special_token` in the TensorRT-LLM backend to `True`.
- Deprecated `GptSession` and `TrtGptModelV1`.

### Model Updates
- Support DBRX
- Support Qwen2
- Support CogVLM
- Support ByT5
- Support LLaMA 3
- Support Arctic (w/ FP8)
- Support Fuyu
- Support Persimmon
- Support Deplot
- Support Phi-3-Mini with long Rope
- Support Neva
- Support Kosmos-2
- Support RecurrentGemma

### Fixed Issues
- - Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
- Fixed segmentation fault with pipeline parallelism and `gather_all_token_logits`. (#1284)
- Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (#1256)
- Fixed an unsupported ScalarType issue for BF16 LoRA. (https://github.com/triton-inference-server/tensorrtllm_backend/issues/403)
- Eliminated the load and save of prompt table in multimodal. (https://github.com/NVIDIA/TensorRT-LLM/discussions/1436)
- Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (#1344)
- Fixed early stopping and failures on in-flight batching cases of Medusa. (#1449)
- Added support for more NVLink versions for auto parallelism. (#1467)
- Fixed the assert failure caused by default values of sampling config. (#1447)
- Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (#1446)
- Fixed MMHA relative position calculation error in `gpt_attention_plugin` for enc-dec models. (#1343)


### Infrastructure changes
  - Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.03-py3`.
  - Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.03-py3`.
  - The dependent TensorRT version is updated to 10.0.1.
  - The dependent CUDA version is updated to 12.4.0.
  - The dependent PyTorch version is updated to 2.2.2.


## TensorRT-LLM Release 0.9.0
### Announcements
- TensorRT-LLM requires TensorRT 9.3 and 24.02 containers.
### Key Features and Enhancements
- **[BREAKING CHANGES]** TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- **[BREAKING CHANGES]** Added support for embedding sharing for Gemma
- Added support for context chunking to work with KV cache reuse
- Enabled different rewind tokens per sequence for Medusa
- Added BART LoRA support (limited to the Python runtime)
- Enabled multi-LoRA for BART LoRA
- Added support for `early_stopping=False` in beam search for C++ Runtime
- Added support for logits post processor to the batch manager
- Added support for import and convert HuggingFace Gemma checkpoints
- Added support for loading Gemma from HuggingFace
- Added support for auto parallelism planner for high-level API and unified builder workflow
- Added support for running `GptSession` without OpenMPI
- Added support for Medusa IFB
- **[Experimental]** Added support for FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- Added support for more head sizes for LLaMA-like models
  - NVIDIA Ampere (SM80, SM86), NVIDIA Ada Lovelace (SM89), NVIDIA Hopper (SM90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256]
- Added support for OOTB functionality
  - T5
  - Mixtral 8x7B
- Benchmark features
  - Added emulated static batching in `gptManagerBenchmark`
  - Added support for arbitrary dataset from HuggingFace for C++ benchmarks
  - Added percentile latency report to `gptManagerBenchmark`
- Performance features
  - Optimized `gptDecoderBatch` to support batched sampling
  - Enabled FMHA for models in BART, Whisper, and NMT family
  - Removed router tensor parallelism to improve performance for MoE models
  - Improved custom all-reduce kernel
- Infrastructure features
  - Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.02-py3`
  - The dependent PyTorch version is updated to 2.2
  - Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.02-py3`
  - The dependent CUDA version is updated to 12.3.2 (12.3 Update 2)

### API Changes

- Added C++ `executor` API
- Added Python bindings
- Added advanced and multi-GPU examples for Python binding of `executor` C++ API
- Added documents for C++ `executor` API
- Migrated Mixtral to high-level API and unified builder workflow
- **[BREAKING CHANGES]** Moved LLaMA convert checkpoint script from examples directory into the core library
- Added support for `LLM()` API to accept engines built by `trtllm-build` command
- **[BREAKING CHANGES]** Removed the `model` parameter from `gptManagerBenchmark` and `gptSessionBenchmark`
- **[BREAKING CHANGES]** Refactored GPT with unified building workflow
- **[BREAKING CHANGES]** Refactored the Qwen model to the unified build workflow
- **[BREAKING CHANGES]** Removed all the LoRA related flags from ``convert_checkpoint.py`` script and the checkpoint content to `trtllm-build` command to generalize the feature better to more models
- **[BREAKING CHANGES]** Removed the ``use_prompt_tuning`` flag, options from the ``convert_checkpoint.py`` script, and the checkpoint content to generalize the feature better to more models. Use `trtllm-build --max_prompt_embedding_table_size` instead.
- **[BREAKING CHANGES]** Changed the `trtllm-build --world_size` flag to the `--auto_parallel` flag. The option is used for auto parallel planner only.
- **[BREAKING CHANGES]** `AsyncLLMEngine` is removed. The `tensorrt_llm.GenerationExecutor` class is refactored to work with both explicitly launching with `mpirun` in the application level and accept an MPI communicator created by `mpi4py`.
- **[BREAKING CHANGES]** `examples/server` are removed.
- **[BREAKING CHANGES]** Removed LoRA related parameters from the convert checkpoint scripts.
- **[BREAKING CHANGES]** Simplified Qwen convert checkpoint script.
- **[BREAKING CHANGES]** Reused the `QuantConfig` used in `trtllm-build` tool to support broader quantization features.
- Added support for TensorRT-LLM checkpoint as model input.
- Refined `SamplingConfig` used in `LLM.generate` or `LLM.generate_async` APIs, with the support of beam search, a variety of penalties, and more features.
- Added support for the ``StreamingLLM`` feature. Enable it by setting `LLM(streaming_llm=...)`.

### Model Updates

- Added support for distil-whisper
- Added support for HuggingFace StarCoder2
- Added support for VILA
- Added support for Smaug-72B-v0.1
- Migrate BLIP-2 examples to `examples/multimodal`

### Limitations

- `openai-triton` examples are not supported on Windows.

### Fixed Issues

- Fixed a weight-only quant bug for Whisper to make sure that the `encoder_input_len_range` is not ``0``. (#992)
- Fixed an issue that log probabilities in Python runtime are not returned. (#983)
- Multi-GPU fixes for multimodal examples. (#1003)
- Fixed a wrong `end_id` issue for Qwen. (#987)
- Fixed a non-stopping generation issue. (#1118, #1123)
- Fixed a wrong link in ``examples/mixtral/README.md``. (#1181)
- Fixed LLaMA2-7B bad results when INT8 kv cache and per-channel INT8 weight only are enabled. (#967)
- Fixed a wrong `head_size` when importing a Gemma model from HuggingFace Hub. (#1148)
- Fixed ChatGLM2-6B building failure on INT8. (#1239)
- Fixed a wrong relative path in Baichuan documentation. (#1242)
- Fixed a wrong `SamplingConfig` tensor in `ModelRunnerCpp`. (#1183)
- Fixed an error when converting SmoothQuant LLaMA. (#1267)
- Fixed an issue that `examples/run.py` only load one line from `--input_file`.
- Fixed an issue that `ModelRunnerCpp` does not transfer `SamplingConfig` tensor fields correctly. (#1183)


## TensorRT-LLM Release 0.8.0

### Key Features and Enhancements

- Chunked context support (see docs/source/advanced/gpt-attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
  - The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
- StreamingLLM support for LLaMA (see docs/source/advanced/gpt-attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
  - Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining `repetition_penalty` and `presence_penalty` #274
- Support for `frequency_penalty` #275
- OOTB functionality support:
  - Baichuan
  - InternLM
  - Qwen
  - BART
- LLaMA
  - Support enabling INT4-AWQ along with FP8 KV Cache
  - Support BF16 for weight-only plugin
- Baichuan
  - P-tuning support
  - INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add `masked_select` and `cumsum` function for modeling
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
    ```{note}
    Some features are not enabled for all models listed in the [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder.
    ```

### Model Updates

- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
  - The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)

Refer to the {ref}`support-matrix-software` section for a list of supported models.

* API
  - Add a set of LLM APIs for end-to-end generation tasks (see examples/llm-api/README.md)
  - **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
  - **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
  - **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
* Fixed Issues
  - Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
  - Fix LLaMA with LoRA enabled build failure #673
  - Fix InternLM SmoothQuant build failure #705
  - Fix Bloom int8_kv_cache functionality  #741
  - Fix crash in `gptManagerBenchmark` #649
  - Fix Blip2 build error #695
  - Add pickle support for `InferenceRequest` #701
  - Fix Mixtral-8x7b build failure with custom_all_reduce #825
  - Fix INT8 GEMM shape #935
  - Minor bug fixes
* Performance
  - **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
  - **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
  - Performance optimization of beam search kernel
  - Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
  - Custom AllReduce plugins performance optimization
  - Top-P sampling performance optimization
  - LoRA performance optimization
  - Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
  - Integrate XQA kernels for GPT-J (beamWidth=4)
* Documentation
  - Batch manager arguments documentation updates
  - Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
  - Add documentation for Falcon AWQ support (See examples/falcon/README.md)
  - Update to the `docs/source/new_workflow.md` documentation
  - Update AWQ INT4 weight only quantization documentation for GPT-J
  - Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
  - Refine TensorRT-LLM backend README structure #133
  - Typo fix #739

## TensorRT-LLM Release 0.7.1

### Key Features and Enhancements

- Speculative decoding (preview)
- Added a Python binding for `GptManager`
- Added a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
- System prompt caching
- Enabled split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- Added Python builder API, `trtllm-build` command, and OPT support
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API
- FHMA support for chunked attention and paged KV cache
- Performance enhancements include:

  - MMHA optimization for MQA and GQA
  - LoRA optimization: cutlass grouped GEMM
  - Optimize Hopper warp specialized kernels
  - Optimize `AllReduce` for parallel attention on Falcon and GPT-J
  - Enable split-k for weight-only cutlass kernel when SM>=75
- Added {ref}`workflow` documentation


### Model Updates

- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support

### Fixed Issues

- Fixed tokenizer usage in `quantize.py` [#288](https://github.com/triton-inference-server/tensorrtllm_backend/issues/288)
- Fixed LLaMa with LoRA error
- Fixed LLaMA GPTQ failure
- Fixed Python binding for InferenceRequest issue
- Fixed CodeLlama SQ accuracy issue

### Known Issues

- The hang reported in issue [#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149) has not been reproduced by the TensorRT-LLM team. If it is caused by a bug in TensorRT-LLM, that bug may be present in that release.

---

# Adding a New Model in PyTorch Backend

## Table of Contents
1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Step-by-Step Guide](#step-by-step-guide)
    1. [Model Configuration](#model-configuration)
    2. [Model Definition](#model-definition)
    3. [Weight Loading](#weight-loading)
    4. [Model Registration](#model-registration)
        1. [Core Models](#core-models)
        2. [Out-of-Tree Models](#out-of-tree-models)

## Introduction

This guide provides a step-by-step process for adding a new model in PyTorch Backend.

## Prerequisites

Before you begin, ensure you have the following:
- A working installation of TensorRT-LLM. Follow these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/build-from-source-linux.md).

## Step-by-Step Guide

### Model Configuration

Suppose you want to support a new model named `MyModel`. If the model is already supported in HuggingFace's transformers, you should bring the PyTorch modeling code and reuse HuggingFace's configuration class. For example, our `tensorrt_llm/_torch/models/modeling_llama.py` was adapted from HuggingFace's [modeling_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py); in the modeling code, we reuse the configuration class:

```python
from transformers import LlamaConfig
```

If the model is not registered in HuggingFace's transformers, you need to define the configuration class in your `configuration_mymodel.py` following HuggingFace's [configuration_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/configuration_llama.py):

```python
from transformers.configuration_utils import PretrainedConfig

class MyConfig(PretrainedConfig):
    def __init__(self, ...):
        ...
```

### Model Definition

Remove any unnecessary code (e.g., training-specific code), and then rewrite some PyTorch modules. For a typical Transformer decoder model, you need to implement your `modeling_mymodel.py` like this:

```python
from typing import Optional

import torch
from torch import nn
from tensorrt_llm._torch.attention_backend import AttentionMetadata
from tensorrt_llm._torch.model_config import ModelConfig
from tensorrt_llm._torch.models.modeling_utils import DecoderModel, DecoderModelForCausalLM
from tensorrt_llm._torch.modules.attention import Attention
from tensorrt_llm._torch.modules.decoder_layer import DecoderLayer

from configuration_mymodel import MyConfig


class MyAttention(Attention):
    def __init__(self, model_config: ModelConfig[MyConfig], layer_idx: Optional[int] = None):
        # Use model_config to initialize the Attention module
        super().__init__(...)


class MyDecoderLayer(DecoderLayer):
    def __init__(self, model_config: ModelConfig[MyConfig], layer_idx: int):
        super().__init__()
        # Use model_config to initialize the submodules
        self.input_layernorm = ...
        self.self_attn = MyAttention(model_config, layer_idx)
        self.post_attention_layernorm = ...
        self.mlp = ...

    def forward(self, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, **kwargs):
        # Define the forward computation of a single decoder layer
        ...


class MyModel(DecoderModel):
    def __init__(self, model_config: ModelConfig[MyConfig]):
        super().__init__(model_config)
        # Use model_config to initialize the submodules
        self.embed_tokens = ...
        self.layers = nn.ModuleList([
            MyDecoderLayer(model_config, layer_idx) for layer_idx in range(model_config.pretrained_config.num_hidden_layers)
        ])

    def forward(self,
                attn_metadata: AttentionMetadata,
                input_ids: Optional[torch.IntTensor] = None,
                position_ids: Optional[torch.IntTensor] = None,
                inputs_embeds: Optional[torch.FloatTensor] = None):
        # Define the forward computation of the model
        ...


class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):
    def __init__(self, model_config: ModelConfig[MyConfig]):
        super().__init__(MyModel(model_config),
                         config=model_config,
                         hidden_size=model_config.pretrained_config.hidden_size,
                         vocab_size=model_config.pretrained_config.vocab_size)
```

Note that `MyAttention` inherits from our `Attention` module (in `tensorrt_llm/_torch/modules/attention.py`), so that the attention computation is compatible with our PyTorch runtime. Related to this, module inputs should also be adapted:

- The `attn_metadata` stores the metadata from the batched input and KV cache for the attention backend. It is created by and passed from the runtime, and model developers need to ensure that `attn_metadata` is correctly passed to the attention module.
- The input tensors (i.e., `input_ids`, `position_ids`, `hidden_states`) are in the packed mode. The first dimension corresponds to the number of tokens in a batch.

Additionally, `MyDecoderLayer`, `MyModel`, and `MyModelForCausalLM` are subclasses of `DecoderLayer`, `DecoderModel`, and `DecoderModelForCausalLM` respectively. The base classes define interfaces and provide a generic scaffolding to define model layers, load weights, etc.

Optionally, you may replace the native PyTorch modules with our implementations to enable features or achieve higher performance:
- `Linear` (in `tensorrt_llm/_torch/modules/linear.py`): Enables tensor parallelism and quantization.
- `Embedding` (in `tensorrt_llm/_torch/modules/embedding.py`): Enables tensor parallelism for embedding.
- `RotaryEmbedding` (in `tensorrt_llm/_torch/modules/rotary_embedding.py`): Enables performant rotary embedding.
- `RMSNorm` (in `tensorrt_llm/_torch/modules/rms_norm.py`): Enables performant RMS norm.

For a concrete reference, check out `tensorrt_llm/_torch/models/modeling_llama.py`.

### Weight Loading

The base class `DecoderModelForCausalLM` provides a `load_weights` method that loads the weights from the checkpoint file and assigns them to the corresponding layers in the model. However, if the default method does not work for `MyModelForCausalLM`, you need to implement your own `load_weights`:

```python
class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):

    def load_weights(self, weights: dict):
        # Define the weight loading logic
        ...
```

For example, Huggingface's LLaMA model uses three linear layers for Q/K/V projections, resulting in three weight tensors in the checkpoint:

```python
>>> weights
{
    ...,
    "model.layers.0.self_attn.q_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.k_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    "model.layers.0.self_attn.v_proj.weight": torch.Tensor([hidden_size, hidden_size]),
    ...,
}
```

However, our LLaMA model fuses the three layers into one linear layer:

```python
>>> llama.model.layers[0].self_attn.qkv_proj.weight.data
torch.Tensor([hidden_size * 3, hidden_size])
```

Hence, `load_weights` needs to collect the three weight tensors from the original checkpoint, concatenate them, and assign them to the fused linear layer. Considering tensor parallelism and quantization, the process would be more complicated. We recommend calling the predefined module-level `load_weights` (e.g., `Linear` and `Embedding`) when implementing your model-level `load_weights` method.

Overall, `load_weights` should handle any discrepancy between `MyModelForCausalLM` and the weights loaded from the checkpoint, so that `MyModelForCausalLM` can perform forward computation equivalent to the original model.

### Model Registration

The new model needs to be registered so that it can be recognized by the PyTorch runtime. The registration can be done simply by adding the `register_auto_model` decorator for `MyModelForCausalLM`:

```python
from tensorrt_llm._torch.models.modeling_utils import register_auto_model

@register_auto_model("MyModelForCausalLM")
class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):
    def __init__(self, model_config: ModelConfig[MyConfig]):
       ...
```

#### Core Models

To add the new model to core models, `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) should be placed in `tensorrt_llm/_torch/models`. Then, you need to import the modeling code in `tensorrt_llm/_torch/models/__init__.py`:

```python
from .modeling_mymodel import MyModelForCausalLM

__all__ = [
    ...,
    "MyModelForCausalLM",
]
```

#### Out-of-Tree Models

Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:

```python
from tensorrt_llm import LLM
import modeling_mymodel

def main():
    llm = LLM(...)

if __name__ == '__main__':
    main()
```

We provide an out-of-tree modeling example in `examples/llm-api/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:

```bash
python examples/llm-api/out_of_tree_example/main.py
```

---

# Architecture Ovewiew

TensorRT LLM is a toolkit designed to create optimized solutions for Large Language Model (LLM) inference.
Besides TensorRT, PyTorch can also serve as the backend for TensorRT-LLM. This document provides an overview of the PyTorch Backend architecture.

## Top Level API

The interface for PyTorch backend is `tensorrt_llm.LLM`.

```python
from tensorrt_llm import LLM
llm = LLM(model=<path_to_llama_from_hf>)
```

The `LLM` also manages the tokenization and detokenization processes of the input.

## PyExecutor


Similar to the TensorRT backend, which uses [Executor API](../advanced/executor.md), the PyTorch backend employs a `PyExecutor` class.
This class has a similar interface to Executor, allowing it to be integrated into LLM as an alternative backend.
Key components of the `PyExecutor` include:

- Model Engine: Holds the language model and efficiently supports single-step model forward.
- Decoder: Generates output tokens based on Model Engine outputs. Currently, only greedy search is supported.
- Scheduler: Decides whether to allocate resources (like KV Cache) for a request and whether to run forward for each request at the current step.

The single-step flow of PyExecutor involves:

- Fetching new requests from the request queue, if any.
- Scheduling some requests.
- Running model forward for scheduled requests.
- Running the decoder using the model forward outputs for the scheduled requests.
- Adding output tokens for each request and handling finished requests.

## Model Engine

The core component of `PyExecutor` is the `ModelEngine`, responsible for executing the model's forward pass efficiently on the GPU.
The key method of `ModelEngine` is `forward`, which handles the forward pass computation.
For the PyTorch backend, the derived class is `PyTorchModelEngine`, declared in [model_engine.py](../../../tensorrt_llm/_torch/pyexecutor/model_engine.py).

## Decoder

The Decoder generates output tokens based on Model Engine outputs and supports greedy search decoding.

## Scheduler

The scheduler operates in two steps:

1. CapacityScheduler: Determines if there are enough resources to accommodate a request.
2. MicroBatchScheduler: Selects some requests for the model to run forward.

Both CapacityScheduler and MicroBatchScheduler currently use C++ bindings.
However, since the interfaces are implemented in Python, customization is possible.
The document [scheduler.md](./scheduler.md) explains how to implement customized scheduling logic.

## ResourceManager

`ResourceManager` helps allocate and manage these resources that may be needed to run inference for a single request.
It is a container of objects inherited from `BaseResourceManager`, each managing a specific type of resource.
There are three important interfaces for `BaseResourceManager`:

- `prepare_resources`: Called at each step before model forward in PyExecutor for the current batch.
- `update_resources`: Called at each step finish for the current batch.
- `free_resources`: Called at each request finish.

One crucial resource is the KV Cache for transformer models. The `BaseResourceManager` for KV Cache is `KVCacheManager`.

### KVCacheManager

Currently, the KVCacheManager uses C++ binding. However, customization in Python is possible, as its interface is implemented in Python.
The document [kv_cache_manager.md](./kv_cache_manager.md) details how to implement a customized KVCacheManager.

---

(attention)=

# Attention

This document details the implementation of multi-head attention (MHA),
multi-query attention (MQA), and group-query attention (GQA) for autoregressive
models in TensorRT-LLM's PyTorch backend. As a quick reminder, multi-head attention
involves a sequence of batched matrix multiplications, a softmax operation, and another batched matrix multiplication,
as described in the [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
[Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are
variants of MHA that use fewer KV heads than the number of query heads.
TensorRT LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`.
The following sections explain how to use these implementations and provide a brief guide on implementing new backends.

## Attention Backends


There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend.
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model.

The vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to its limited optimizations.

In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features:

1. **FP8 Quantization**: This feature enables the quantization of inputs and KV cache into FP8 format, significantly reducing memory usage and improving computational throughput.
2. **RoPE Fusion**: By integrating rotary position embedding (RoPE) directly into the attention computation, this feature enhances efficiency and reduces overhead.

The TRT-LLM backend, `TrtllmAttention`, serves as the default backend and supports all the features available in the Flashinfer backend while being further optimized for enhanced performance. It is the recommended choice for production environments. Additionally, it offers the following advanced features:

1. **Fused QKV Input**: It can accept a single QKV tensor as input, which is more efficient compared to using separate Q, K, and V tensors.
2. **FP8 Output**: It supports outputting the attention result in FP8 format, fusing quantization into the attention computation process.

## Implement a New Attention Backend

You can implement a new attention backend to integrate other attention libraries.
An attention backend consists of an `AttentionBackend` class and an `AttentionMetadata` class.
There are three stages in the PyTorch that involve the attention backend:

1. Model construction: During the model's `__init__`, call `AttentionBackend.__init__` to create an attention backend for each layer.
2. Metadata preparation: Before each forward step of the model:
   1. If the metadata is uninitialized, call `AttentionMetadata.__init__` to create the attention metadata.
   2. If using CUDA graphs, call `AttentionMetadata.create_cuda_graph_metadata` to convert the metadata to CUDA graph metadata, which pre-allocates all tensors and can be used to capture CUDA graphs. Do not re-allocate any tensors stored inside `AttentionMetadata` after the initial warmup run when using CUDA graphs.
   3. To prepare parameters of the input and KV cache, call `AttentionMetadata.prepare` to convert from existing metadata and KV cache manager.
3. Single step forward: During the forward pass of each attention layer, call `AttentionBackend.forward` to perform the attention operation. The `AttentionMetadata` will be provided as a forward argument.

### Implement `AttentionMetadata`

The `AttentionMetadata` class stores metadata from the batched input and KV cache for the attention backend.
It contains the following predefined fields:

| Field | Type | Description |
| ----- | ---- | ----------- |
| max_num_requests | int | The max number of requests in a single batch. |
| num_contexts | int | The number of context-phase sequences in the batch. |
| num_generations | int | The number of generation-phase sequences in the batch. |
| max_num_tokens | int | The max number of tokens in all requests in a single batch. |
| num_tokens | int | Number of tokens in the batch. |
| num_ctx_tokens | int | Number of tokens in sequences in the context phase. |
| kv_cache_manager | KVCacheManager | The KV cache manager. |
| is_cuda_graph | bool | Whether CUDA graph is enabled. |
| seq_lens | Tensor | The length of each sequence in the batch. The shape is (batch_size), and located on CPU memory. |
| seq_lens_cuda | Tensor | A copy of `seq_lens` store on the GPU. |
| context_lens | Tensor | The length of each context-phase sequence in the batch. The shape is (`num_contexts`). |
| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside of the backend. |
| request_ids | List[int] | The request ID of each sequence in the batch. |
| prompt_lens | List[int] | The prompt length of each sequence in the batch. |
| kv_cache_params | KVCacheParams | The parameters for the KV cache. |

During `AttentionMetadata.__init__`, you can initialize additional fields for the new attention metadata.
For example, the Flashinfer metadata initializes `decode_wrapper` here.
During `AttentionMetadata.prepare`, the runtime will fill all predefined fields, and you can fill your customized fields according to these predefined fields.
For example, the Flashinfer metadata fills `qo_indptr` by combining `context_lens` and `num_generations` here.

### Implement `AttentionBackend`

The `AttentionBackend` delegates the attention operation to the backend implementation.

Its `__init__` accepts the following arguments:

| Field | Type | Description |
| ----- | ---- | ----------- |
| layer_idx | int | The index of the attention layer in the model. |
| num_heads | int | The number of query heads. |
| head_dim | int | The size of each attention head `(hidden_size // num_heads)`. |
| num_kv_heads | Optional[int] | The number of KV heads. Defaults to num_heads if None. |
| quant_config | QuantConfig | Optional quantization configuration. If None, no quantization is applied. |
| pos_embd_params | PositionalEmbeddingParams | Optional parameters defining how positional embedding should be applied. If None, positional embedding should be applied by the model before calling the backend. Otherwise, the backend is in-charge of applying positional embedding and may cache K without embedding it first. |

Its `forward` accepts the following arguments:

| Field | Type | Description |
| ----- | ---- | ----------- |
| q | Tensor | Query tensor with shape `(num_tokens, num_heads * head_dim)`. |
| k | Tensor | Key tensor with shape `(num_tokens, num_kv_heads * head_dim)`. |
| v | Tensor | Value tensor with shape `(num_tokens, num_kv_heads * head_dim)`. |
| metadata | AttentionMetadata | Metadata for the attention operation. |
| attention_mask | AttentionMask | Optional attention mask. If None, causal mask is applied. |

For example, the Flashinfer backend calls `append_paged_kv_cache` and then wrapper's `run` to perform the attention operation here.

---

# Benchmarking with trtllm-bench

AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.

## Getting Started

Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.

## Basic Usage

Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:

```bash
trtllm-bench \
  --model meta-llama/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend _autodeploy
```

```{note}
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
```

## Advanced Configuration

For more granular control over AutoDeploy's behavior during benchmarking, use the `--config` flag with a YAML configuration file:

```bash
trtllm-bench \
  --model meta-llama/Llama-3.1-8B \
  throughput \
  --dataset /tmp/synthetic_128_128.txt \
  --backend _autodeploy \
  --config autodeploy_config.yaml
```

### Configuration Examples

#### Basic Performance Configuration (`autodeploy_config.yaml`)

```yaml
# runtime engine
runtime: trtllm

# model loading
skip_loading_weights: false

# Sequence configuration
max_batch_size: 256

# transform options
transforms:
  insert_cached_attention:
    # attention backend
    backend: flashinfer
  resize_kv_cache:
    # fraction of free memory to use for kv-caches
    free_mem_ratio: 0.8
  compile_model:
    # compilation backend
    backend: torch-opt
    # CUDA Graph optimization
    cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs.

## Configuration Options Reference

### Core Performance Settings

| Parameter | Default | Description |
|-----------|---------|-------------|
| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |

### CUDA Graph Optimization

| Parameter | Default | Description |
|-----------|---------|-------------|
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |

```{tip}
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
```

## Performance Optimization Tips

1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
1. **Compilation Backend**: Use `torch-opt` for production workloads
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.

---

# Example Run Script

To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:

```bash
cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

You can configure your experiment with various options. Use the `-h/--help` flag to see available options:

```bash
python build_and_run_ad.py --help
```

The following is a non-exhaustive list of common configuration options:

| Configuration Key | Description |
|-------------------|-------------|
| `--model` | The HF model card or path to a HF checkpoint folder |
| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
| `--args.compile-backend` | Specifies how to compile the graph at the end |
| `--args.attn-backend` | Specifies kernel implementation for attention |
| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
| `--args.max-seq-len` | Maximum sequence length for inference/cache |
| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
| `--args.attn-page-size` | Page size for attention |
| `--prompt.batch-size` | Number of queries to generate |
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |

For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.

The following is a more complete example of using the script:

```bash
cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True
```

---

# Expert Configuration of LLM API

For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.

- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
  Please make sure to refer to those first.
- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
- Note that some fields in the full `LlmArgs`
  object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
  pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
  significantly differs from the default manual workflow in TensorRT-LLM.
- However, with the proper care the full `LlmArgs`
  objects can be used to configure advanced runtime options in TensorRT-LLM.
- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.

# Expert Configuration of `build_and_run_ad.py`

For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.

## CLI Arguments with Dot Notation

The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:

```bash
# Configure model parameters
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
# specified as CLI arg
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --args.model-kwargs.num-hidden-layers=10 \
  --args.model-kwargs.hidden-size=2048 \
  --args.tokenizer-kwargs.padding-side=left

# Configure runtime and backend options
python build_and_run_ad.py \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --args.world-size=2 \
  --args.compile-backend=torch-opt \
  --args.attn-backend=flashinfer

# Configure prompting and benchmarking
python build_and_run_ad.py \
  --model "microsoft/phi-4" \
  --prompt.batch-size=4 \
  --prompt.sp-kwargs.max-tokens=200 \
  --prompt.sp-kwargs.temperature=0.7 \
  --benchmark.enabled=true \
  --benchmark.bs=8 \
  --benchmark.isl=1024
```

## YAML Configuration Files

Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.

Create a YAML configuration file (e.g., `my_config.yaml`):

```yaml
# my_config.yaml
args:
  model_kwargs:
    num_hidden_layers: 12
    hidden_size: 1024
  world_size: 4
  max_seq_len: 2048
  max_batch_size: 16
  transforms:
    detect_sharding:
      support_partial_config: true
    insert_cached_attention:
      backend: triton
    compile_model:
      backend: torch-compile

prompt:
  batch_size: 8
  sp_kwargs:
    max_tokens: 150
    temperature: 0.8
    top_k: 50
```

Create an additional override file (e.g., `production.yaml`):

```yaml
# production.yaml
args:
  world_size: 8
  max_batch_size: 32
  transforms:
    compile_model:
      backend: torch-opt
```

Then use these configurations:

```bash
# Using single YAML config
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml

# Using multiple YAML configs (deep merged in order, later files have higher priority)
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml production.yaml

# Targeting nested AutoDeployConfig with separate YAML
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra my_config.yaml \
  --args.yaml-extra autodeploy_overrides.yaml
```

## Configuration Precedence and Deep Merging

The configuration system follows a precedence order in which higher priority sources override lower priority ones:

1. **CLI Arguments** (highest priority) - Direct command line arguments
1. **YAML Configs** - Files specified via `--yaml-extra` and `--args.yaml-extra`
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes

**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:

```yaml
# Base config
args:
  model_kwargs:
    num_hidden_layers: 10
    hidden_size: 1024
  max_seq_len: 2048
```

```yaml
# Override config
args:
  model_kwargs:
    hidden_size: 2048  # This will override
    # num_hidden_layers: 10 remains unchanged
  world_size: 4  # This gets added
```

**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:

```bash
# The outer yaml-extra affects the entire ExperimentConfig
# The inner args.yaml-extra affects only the AutoDeployConfig
python build_and_run_ad.py \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --yaml-extra experiment_config.yaml \
  --args.yaml-extra autodeploy_config.yaml \
  --args.world-size=8  # CLI override beats both YAML configs
```

## Sharding configuration

The `detect_sharding` transform automatically detects and applies sharding strategies to the model. It supports multiple sharding sources and dimensions, allowing flexible configuration for different model architectures and parallelism strategies.

### Configuration Parameters

The `detect_sharding` transform accepts the following configuration parameters:

#### `simple_shard_only` (bool, default: `false`)

When set to `true`, forces simple sharding (row-wise sharding with all-gather) for all linear layers, bypassing more sophisticated column/row sharding strategies. This is useful when you want a uniform sharding approach across all layers or when debugging sharding issues.

#### `sharding_source` (list, default: `['manual', 'factory', 'heuristic']`)

Specifies the priority order of sharding sources. The order matters: if multiple sources try to apply sharding to the same layer, only the first one in the list will be applied. The available sources are:

- **`'manual'`**: Uses manually provided sharding configuration via `manual_config` parameter
- **`'factory'`**: Uses factory-provided sharding configuration (e.g., from HuggingFace model configs)
- **`'heuristic'`**: Uses automatic heuristic-based sharding detection based on layer patterns

Example: If both `manual` and `heuristic` try to apply sharding to layer L, only the `manual` transformation will be applied since it appears first in the list.

#### `support_partial_config` (bool, default: `true`)

When `true`, allows partial sharding configurations where not all layers need to be specified in the manual or factory config. Layers not explicitly configured will be handled by heuristic sharding or left unsharded. When `false`, the configuration must specify all layers or it will be invalidated and skipped.

#### `sharding_dims` (list, default: `['tp', 'ep', 'bmm']`)

Specifies which sharding dimensions to apply during heuristic sharding. The available dimensions are:

- **`'tp'`**: Tensor parallelism - applies column/row sharding for standard transformer layers
- **`'ep'`**: Expert parallelism - shards experts across ranks for Mixture-of-Experts (MoE) models
- **`'bmm'`**: Batch matrix multiplication sharding - shards batch matrix multiplication operations
- **`'ssm'`**: State space model sharding - applies specialized sharding for Mamba/SSM layers

You can enable multiple dimensions simultaneously. For example, `['tp', 'ep']` will apply both tensor parallelism and expert parallelism.

#### `process_grid` (dict, default: `None`)

Specifies a 2D device mesh for hybrid EP+TP parallelism.

- NOTE 1: This grid applies only to the MoE layers. Attention, Mamba, and MLP layers are unaffected.
- NOTE 2: The order of the keys matters. Process grid's layout is in the generalized column-major order,
  that is, the last dimension is stride-one.
- NOTE 3: `ep * tp` must be equal to the provided world size. Otherwise, the mesh will be considered invalid,
  and 1D ep-only parallelism will be applied.

Example:

```
    process_grid: {'ep': 2, 'tp': 2}
```

If `world_size == 4`, ranks \[0,1\] and \[2,3\] will create two EP groups. Experts will be distributed across these two
groups, and internally, TP=2 column-row sharding will be applied.

#### `requires_shape_prop` (bool, default: `true`)

Whether shape propagation is required before applying this transform. Shape propagation enables the transform to make informed decisions about sharding strategies based on tensor dimensions.

### Manual TP Sharding Configuration

For advanced users, you can provide a manual sharding configuration. An example of such setting:

```yaml
args:
  transforms:
    detect_sharding:
      manual_config:
        head_dim: 128
        tp_plan:
          # mamba SSM layers
          in_proj: mamba
          out_proj: rowwise
          # attention layers
          q_proj: colwise
          k_proj: colwise
          v_proj: colwise
          o_proj: rowwise
          # NOTE: for performance reason, consider not sharding the following
          # layers at all. Commenting out the following layers will replicate
          # them across ranks.
          # MLP and shared experts in MoE layers
          gate_proj: colwise
          up_proj: colwise
          down_proj: rowwise
          # MoLE: latent projections: simple shard
          fc1_latent_proj: gather
          fc2_latent_proj: gather
```

The `tp_plan` dictionary maps layer names (using module paths with wildcard `*` support) to sharding strategies:

- **`colwise`**: Column-wise sharding (splits the weight matrix along columns)
- **`rowwise`**: Row-wise sharding (splits the weight matrix along rows)
- **`mamba`**: Specialized sharding for Mamba SSM layers
- **`gather`**: Simple shard with row-wise sharding and all-gather operation

## Built-in Default Configuration

Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.

The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:

```bash
# View the default configuration
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml

# Override specific transform settings
python build_and_run_ad.py \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --args.transforms.export-to-gm.strict=true
```

---

# Logging Level

Use the following env variable to specify the logging level of our built-in logger, ordered by
decreasing verbosity;

```bash
AUTO_DEPLOY_LOG_LEVEL=DEBUG
AUTO_DEPLOY_LOG_LEVEL=INFO
AUTO_DEPLOY_LOG_LEVEL=WARNING
AUTO_DEPLOY_LOG_LEVEL=ERROR
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
```

The default log level is `INFO`.

---

# Serving with trtllm-serve

AutoDeploy integrates with the OpenAI-compatible `trtllm-serve` CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request.

## Quick start

Launch `trtllm-serve` with the AutoDeploy backend by setting `--backend _autodeploy`:

```bash
trtllm-serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --backend _autodeploy
```

- `model`: HF name or local path
- `--backend _autodeploy`: uses AutoDeploy runtime

Once the server is ready, test with an OpenAI-compatible request:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages":[{"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
    "max_tokens": 32
  }'
```

## Configuration via YAML

Use `--config` to supply a YAML file that augments or overrides server/runtime settings.

```bash
trtllm-serve \
  meta-llama/Llama-3.1-8B \
  --backend _autodeploy \
  --config autodeploy_config.yaml
```

Example `autodeploy_config.yaml`:

```yaml
# runtime engine
runtime: trtllm

# model loading
skip_loading_weights: false

# Sequence configuration
max_batch_size: 256

# multi-gpu execution
world_size: 1

# transform options
transforms:
  insert_cached_attention:
    # attention backend
    backend: flashinfer
  resize_kv_cache:
    # fraction of free memory to use for kv-caches
    free_mem_ratio: 0.8
  compile_model:
    # compilation backend
    backend: torch-opt
    # CUDA Graph optimization
    cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

## Limitations and tips

- KV cache block reuse is disabled automatically for AutoDeploy backend
- AutoDeploy backend doesn't yet support disaggregated serving. WIP
- For best performance:
  - Prefer `compile_backend: torch-opt`
  - Use `attn_backend: flashinfer`
  - Set realistic `cuda_graph_batch_sizes` that match expected traffic
  - Tune `free_mem_ratio` to 0.8–0.9

## See also

- [AutoDeploy overview](../auto-deploy.md)
- [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md)

---

### Incorporating `auto_deploy` into your own workflow

AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.

The following example demonstrates how to build an LLM object with AutoDeploy integration:

```
from tensorrt_llm._torch.auto_deploy import LLM


# Construct the LLM high-level interface object with autodeploy as backend
llm = LLM(
    model=<HF_MODEL_CARD_OR_DIR>,
    world_size=<DESIRED_WORLD_SIZE>,
    model_factory="AutoModelForCausalLM", # choose appropriate model factory
    model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
    transforms={
        "insert_cached_attention": {"backend": "flashinfer"},  # or "triton"
        "insert_cached_mla_attention": {"backend": "MultiHeadLatentAttention"},
        "resize_kv_cache": {"free_mem_ratio": 0.8},
        "compile_model": {"backend": "torch-compile"},
        "detect_sharding": {"simple_shard_only": False},

    },
    attn_page_size=64, # page size for attention
    skip_loading_weights=False,
    max_seq_len=<MAX_SEQ_LEN>,
    max_batch_size=<MAX_BATCH_SIZE>,
)

```

For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.

---

# AutoDeploy

```{note}
This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.
```

### Seamless Model Deployment from PyTorch to TensorRT-LLM

AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models such as those from the Hugging Face Transformers library, to TensorRT-LLM.

![AutoDeploy overview](../../media/ad_overview.png)
<sub><em>AutoDeploy overview and relation with TensorRT-LLM's LLM API</em></sub>

AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.

### Key Feature:

- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites.
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
- **Quick Setup & Prototyping:** Lightweight pip package for easy installation with a demo environment for fast testing.

## Get Started

1. **Install AutoDeploy:**

AutoDeploy is included with the TRT-LLM installation.

```bash
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
```

You can refer to [TRT-LLM installation guide](../../installation/linux.md) for more information.

2. **Run Llama Example:**

You are now ready to run an in-framework LLama Demo.

The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:

```bash
cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

## Support Matrix

AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.

The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.

- [Support Matrix](support_matrix.md)

## Advanced Usage

- [Example Run Script](./advanced/example_run.md)
- [Logging Level](./advanced/logging.md)
- [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
- [Expert Configurations](./advanced/expert_configurations.md)
- [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
- [Serving with trtllm-serve](./advanced/serving_with_trtllm_serve.md)

## Roadmap

We are actively expanding AutoDeploy to support a broader range of model architectures and inference features.

**Upcoming Model Support:**

- Vision-Language Models (VLMs)

- Structured State Space Models (SSMs) and Linear Attention architectures

**Planned Features:**

- Low-Rank Adaptation (LoRA)

- Speculative Decoding for accelerated generation

To track development progress and contribute, visit our [Github Project Board](https://github.com/orgs/NVIDIA/projects/83/views/13).
We welcome community contributions, see `examples/auto_deploy/CONTRIBUTING.md` for guidelines.

---

## Support Matrix

AutoDeploy streamlines model deployment with an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.

The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.

### Support Models

**Bring Your Own Model**: AutoDeploy leverages `torch.export` and dynamic graph pattern matching, enabling seamless integration for a wide variety of models without relying on hard-coded architectures.

AutoDeploy supports Hugging Face models compatible with `AutoModelForCausalLM` and `AutoModelForImageTextToText`.
In addition, the following models have been officially validated using the default configuration: `runtime=trtllm`, `compile_backend=torch-compile`, and `attn_backend=flashinfer`

<details>
<summary>Click to expand supported models list</summary>

- Qwen/QwQ-32B
- Qwen/Qwen2.5-0.5B-Instruct
- Qwen/Qwen2.5-1.5B-Instruct
- Qwen/Qwen2.5-3B-Instruct
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen3-0.6B
- Qwen/Qwen3-235B-A22B
- Qwen/Qwen3-30B-A3B
- Qwen/Qwen3-4B
- Qwen/Qwen3-8B
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
- apple/OpenELM-1_1B-Instruct
- apple/OpenELM-270M-Instruct
- apple/OpenELM-3B-Instruct
- apple/OpenELM-450M-Instruct
- bigcode/starcoder2-15b-instruct-v0.1
- bigcode/starcoder2-7b
- deepseek-ai/DeepSeek-Prover-V1.5-SFT
- deepseek-ai/DeepSeek-Prover-V2-7B
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- google/codegemma-7b-it
- google/gemma-1.1-7b-it
- google/gemma-2-27b-it
- google/gemma-2-2b-it
- google/gemma-2-9b-it
- google/gemma-2b
- google/gemma-3-1b-it
- ibm-granite/granite-3.1-2b-instruct
- ibm-granite/granite-3.1-8b-instruct
- ibm-granite/granite-3.3-2b-instruct
- ibm-granite/granite-3.3-8b-instruct
- ibm-granite/granite-guardian-3.1-2b
- ibm-granite/granite-guardian-3.2-5b
- meta-llama/CodeLlama-34b-Instruct-hf
- meta-llama/CodeLlama-7b-Instruct-hf
- meta-llama/CodeLlama-7b-Python-hf
- meta-llama/Llama-2-13b-chat-hf
- meta-llama/Llama-2-7b-chat-hf
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- meta-llama/Llama-4-Maverick-17B-128E-Instruct
- meta-llama/Llama-4-Scout-17B-16E-Instruct
- microsoft/Phi-3-medium-128k-instruct
- microsoft/Phi-3-medium-4k-instruct
- microsoft/Phi-4-mini-instruct
- microsoft/Phi-4-mini-reasoning
- microsoft/Phi-4-reasoning
- microsoft/Phi-4-reasoning-plus
- microsoft/phi-4
- mistralai/Codestral-22B-v0.1
- mistralai/Mistral-7B-Instruct-v0.2
- mistralai/Mistral-7B-Instruct-v0.3
- mistralai/Mixtral-8x22B-Instruct-v0.1
- nvidia/Llama-3.1-405B-Instruct-FP8
- nvidia/Llama-3.1-70B-Instruct-FP8
- nvidia/Llama-3.1-8B-Instruct-FP8
- nvidia/Llama-3.1-Minitron-4B-Depth-Base
- nvidia/Llama-3.1-Minitron-4B-Width-Base
- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
- nvidia/Llama-3.1-Nemotron-Nano-8B-v1
- nvidia/Llama-3_1-Nemotron-51B-Instruct
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
- nvidia/Mistral-NeMo-Minitron-8B-Base
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
- perplexity-ai/r1-1776-distill-llama-70b

</details>

### Runtime Integrations

AutoDeploy runs natively with the complete `TRT-LLM` stack via the `LLM` API. In addition, we provide a light-weight wrapper of the `LLM` API for onboarding and debugging new models:

| `"runtime"` | Description |
|-------------|-------------|
| `trtllm`    | A robust, production-grade runtime optimized for high-performance inference. |
| `demollm`   | A lightweight runtime wrapper designed for development and testing, featuring a naive scheduler and KV-cache manager for simplified debugging and testing. |

### Compile Backends

AutoDeploy supports multiple backends for compiling the exported Torch graph:

| `"compile_backend"` | Description |
|--------------------|-------------|
| `torch-simple`     | Exports the graph without additional optimizations. |
| `torch-compile`    | Applies `torch.compile` to the graph after all AutoDeploy transformations have been completed. |
| `torch-cudagraph`  | Performs CUDA graph capture (without torch.compile). |
| `torch-opt`        | Uses `torch.compile` along with CUDA Graph capture to enhance inference performance. |

### Attention backends

Optimize attention operations with different attention kernel implementations:

| `"attn_backend"` | Description |
|----------------------|-------------|
| `triton` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
| `flashinfer`         | Uses optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |

### Precision Support

AutoDeploy supports models with various precision formats, including quantized checkpoints generated by [`Model-Optimizer`](https://github.com/NVIDIA/Model-Optimizer).

**Supported precision types include:**

- BF16 / FP16 / FP32
- FP8
- [NVFP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)

---

# Checkpoint Loading

The PyTorch backend provides a flexible and extensible infrastructure for loading model checkpoints from different sources and formats, such as HuggingFace (HF) or custom formats, by implementing required components like the checkpoint's weight loader, mapper, and configuration parser.

## Table of Contents
1. [Overview](#overview)
2. [Core Components](#core-components)
3. [Built-in Checkpoint Formats](#built-in-checkpoint-formats)
4. [Using Checkpoint Loaders](#using-checkpoint-loaders)
5. [Creating Custom Checkpoint Loaders](#creating-custom-checkpoint-loaders)

## Overview

The checkpoint loading design is built around a plugin-like architecture that is separated into four distinct components:

- **Checkpoint Loaders**: Orchestrates the loading process for specific formats.
- **Config Loaders**: Handles model configuration parsing and validation.
- **Weight Loaders**: Manages the actual loading of model weights from storage into memory.
- **Weight Mappers**: Maps and transforms loaded weights to the TRTLLM model's definition.

This modular design allows for easy extension to support new checkpoint formats while maintaining backward compatibility and performance optimizations. By separating checkpoint loading into four subcomponents, users can leverage existing implementations and introduce custom, checkpoint-specific components.

To support a new checkpoint format, you must implement all four components.
If the format shares components with an existing framework (such as HF), you only need to implement the components that differ.

## Core Components

### BaseCheckpointLoader

The `BaseCheckpointLoader` is the central interface for all checkpoint loading operations. It provides a unified API regardless of the underlying checkpoint format. This interface is responsible for holding and exposing all objects required for the loading and parsing process.

**Key Methods:**
- `load_config(checkpoint_dir, **kwargs)`: Loads and returns a `ModelConfig` object
- `load_weights(checkpoint_dir, **kwargs)`: Loads and returns a dictionary of weights
- `get_initialized_weight_mapper(model, config)`: Returns a weight mapper initialized at runtime for the model
- `cleanup()`: Releases resources and cleans up internal state

### BaseConfigLoader

Loads model configurations from checkpoint directories and parses them into a TRTLLM `ModelConfig`:

```python
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader

class CustomConfigLoader(BaseConfigLoader):
    def load(self, checkpoint_dir: str, **kwargs) -> ModelConfig:
        # Load and parse configuration from your custom format
        pretrained_config = self._get_pretrained_config(checkpoint_dir, **kwargs)

        return ModelConfig(pretrained_config=pretrained_config,
                            ...)

    def _get_pretrained_config(self, checkpoint_dir, **kwargs):
        ...

```

### BaseWeightLoader

Handles the loading of model weights from storage:

```python
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader

class CustomWeightLoader(BaseWeightLoader):
    def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
        # Load weights from your custom format
        # Return a dictionary mapping parameter names to tensors
        return weights_dict
```

### BaseWeightMapper

Transforms weights between different naming conventions and applies model-specific transformations to the TRTLLM model object.

## Built-in Checkpoint Formats

### HuggingFace Format

Currently, the HF checkpoint loader is the primary built-in format and supports:

- **Weights loading** (`.safetensors, .bin, .pth`): Load HF-compatible weights from disk
- **Configuration parser** - Parse configuration information stored by HF into a TRTLLM `ModelConfig` object
- **Weights Mapping** - Convert HF weights into a TRTLLM-compatible representation

## Using Checkpoint Loaders

### Basic Usage

There are two main approaches for using checkpoint loading objects

The first approach is through the llm-api, as shown in the following example:

```python
from tensorrt_llm import LLM

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir)
```

In this example, the `HfCheckpointLoader` is selected by default.

To explicitly set the checkpoint loader, specify the required checkpoint-specific loader:

```python
from tensorrt_llm import LLM
from tensorrt_llm._torch.models.checkpoints.hf.checkpoint_loader import HfCheckpointLoader

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir,
          checkpoint_loader=HfCheckpointLoader())
```

Similarly, to use a basic checkpoint loader with a specific subcomponent, provide the desired subcomponent as needed:

```python
from tensorrt_llm import LLM
from tensorrt_llm._torch.models.checkpoints.hf.checkpoint_loader import HfCheckpointLoader

hf_model_dir = "llama-models-v2/llama-v2-13b-hf"

llm = LLM(model=hf_model_dir,
          checkpoint_loader=HfCheckpointLoader(weight_loader=MyCustomWeightLoader()))
```

In the second approach, you can directly use the individual checkpoint loading components:

```python
from tensorrt_llm._torch.models.checkpoints.hf.gemma3_weight_mapper import \
    Gemma3HfWeightMapper
from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM

gemma3 = Gemma3ForCausalLM(model_config)
weight_mapper = Gemma3HfWeightMapper()
weight_mapper.init_model_and_config(gemma3, model_config)
gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
```
## Creating Custom Checkpoint Loaders

To support a new checkpoint format, implement all four components. This section provides minimal templates for each.

### When to Create Custom Components

- **Complete New Format**: Implement all four components to support a new checkpoint format
- **Custom Weight Storage**: Implement only a custom weight loader if you have a unique weight storage format (such as a custom binary format or database storage)
- **Custom Configuration**: Implement only a custom config loader if your configuration format cannot be parsed by existing loaders
- **Custom Weight Mapping**: Implement only a custom weight mapper if your model has unique weight naming or transformation requirements that are checkpoint-specific

### Step 1: Create the Checkpoint Loader

```python
from typing import Optional
from tensorrt_llm._torch.models.checkpoints.base_checkpoint_loader import BaseCheckpointLoader
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
from tensorrt_llm._torch.models.checkpoints.base_weight_mapper import BaseWeightMapper
from tensorrt_llm._torch.models.modeling_utils import register_checkpoint_loader

@register_checkpoint_loader("CUSTOM_FORMAT")
class CustomCheckpointLoader(BaseCheckpointLoader):
    def __init__(self,
                 *,
                 weight_loader: Optional[BaseWeightLoader] = None,
                 weight_mapper: Optional[BaseWeightMapper] = None,
                 config_loader: Optional[BaseConfigLoader] = None):
        self._weight_loader = weight_loader or self.get_default_weight_loader()
        self._config_loader = config_loader or self.get_default_config_loader()
        self._weight_mapper = weight_mapper
        self._checkpoint_format = "CUSTOM_FORMAT" # Set the checkpoint format name

    def get_default_weight_loader(self) -> BaseWeightLoader:
        return CustomWeightLoader()

    def get_default_config_loader(self) -> BaseConfigLoader:
        return CustomConfigLoader()
```

### Step 2: Create the Checkpoint Weight Loader

```python
from typing import Any
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
from tensorrt_llm._torch.models.modeling_utils import register_checkpoint_weight_loader

@register_checkpoint_weight_loader("CUSTOM_FORMAT")
class CustomWeightLoader(BaseWeightLoader):
    def load_weights(self, checkpoint_dir: str, **kwargs) -> dict[str, Any]:
        """
        Load weights from your custom format.

        Args:
            checkpoint_dir: Directory containing checkpoint files
            **kwargs: Additional loading parameters

        Returns:
            Dictionary mapping parameter names to tensors
        """
        weights = {} # Implement your custom weight loading logic here

        # Examples:
        # - Load from custom binary files
        # - Load from databases
        # - Load from compressed archives
        # - Apply custom preprocessing

        return weights
```

### Step 3: Create the Checkpoint Config Loader

```python
from tensorrt_llm._torch.model_config import ModelConfig
from tensorrt_llm._torch.models.checkpoints.base_config_loader import BaseConfigLoader
from tensorrt_llm._torch.models.modeling_utils import register_config_loader

@register_config_loader("CUSTOM_FORMAT")
class CustomConfigLoader(BaseConfigLoader):
    def load(self, checkpoint_dir: str, **kwargs) -> ModelConfig:
        """
        Load and parse configuration from your custom format.

        Args:
            checkpoint_dir: Directory containing configuration files
            **kwargs: Additional loading parameters

        Returns:
            ModelConfig object containing parsed configuration
        """
        # Load your custom configuration format here
        # Examples:
        # - Parse YAML/TOML files
        # - Convert from proprietary formats

        pretrained_config = self._load_pretrained_config(checkpoint_dir, **kwargs)

        return ModelConfig(
            pretrained_config=pretrained_config,
            # Add other ModelConfig parameters as needed
        )

    def _load_pretrained_config(self, checkpoint_dir: str, **kwargs):
        """Load the raw configuration from your custom format."""
        # Implement as needed
        pass
```

### Step 4: Create the Checkpoint Weight Mapper

```python
from torch import nn
from tensorrt_llm._torch.models.checkpoints.base_weight_mapper import BaseWeightMapper
from tensorrt_llm._torch.models.modeling_utils import register_mapper

@register_mapper("CUSTOM_FORMAT")
class CustomWeightMapper(BaseWeightMapper):
    def __init__(self):
        super().__init__()
        # Define any weight transformation callbacks
        self._callbacks = [
            # Add your custom weight transformation functions
            # self._custom_transform_function,
        ]

    def map_weights(self) -> None:
        """
        Define mappings between source and target weight names.
        """
        self.mapping.update({
            # Map source names to target names
            # 'target_module_name': ['source_param1', 'source_param2'],
            # For example: 'qkv_proj': ['q_proj', 'k_proj', 'v_proj']
        })

    def apply_callbacks(self, module: nn.Module, module_name: str,
                        module_names_breakdown: list[str],
                        weights: dict) -> list[dict]:
        """
        Apply weight transformations for modules that require special handling.

        Args:
            module: The target module
            module_name: The specific module name being processed
            module_names_breakdown: Module path components
            weights: Source weights dictionary

        Returns:
            List of transformed weight dictionaries
        """
        module_weights = []

        for new_name in self._mapping[module_name]:
            # Filter weights for this specific parameter
            fw = self.filter_weights(
                '.'.join(module_names_breakdown + [new_name]), weights)

            # Apply transformation callbacks
            for callback in self._callbacks:
                fw = callback(module, new_name, fw)

            module_weights.append(fw)

        return module_weights

    def should_skip_module(self, module_name: str) -> bool:
        """
        Define which modules should be skipped during loading.
        """
        # Add logic to skip specific modules based on your requirements
        # Examples:
        # - Skip LoRA-specific modules
        # - Skip temporary/auxiliary modules

        return super().should_skip_module(module_name)
```

Note: When creating a custom mapper, you can define either a checkpoint-format-specific mapper. For example:

```python
@register_mapper("CUSTOM_FORMAT")
class CustomWeightMapper(BaseWeightMapper)
```

Alternatively, you can define a checkpoint-model-specific mapper. For example:

```python
@register_mapper("CUSTOM_FORMAT", "Gemma3ForCausalLM")
class CustomWeightMapper(BaseWeightMapper)
```

By setting the model name, the registered mapper will be associated with the specific model.

---

# LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.

## Table of Contents
1. [Background](#background)
2. [Basic Usage](#basic-usage)
   - [Single LoRA Adapter](#single-lora-adapter)
   - [Multi-LoRA Support](#multi-lora-support)
3. [Advanced Usage](#advanced-usage)
   - [LoRA with Quantization](#lora-with-quantization)
   - [NeMo LoRA Format](#nemo-lora-format)
   - [Cache Management](#cache-management)
4. [TRTLLM serve with LoRA](#trtllm-serve-with-lora)
   - [YAML Configuration](#yaml-configuration)
   - [Starting the Server](#starting-the-server)
   - [Client Usage](#client-usage)
5. [TRTLLM bench with LORA](#trtllm-bench-with-lora)
   - [YAML Configuration](#yaml-configuration)
   - [Run trtllm-bench](#run-trtllm-bench)

## Background

The PyTorch backend provides LoRA support, allowing you to:
- Load and apply multiple LoRA adapters simultaneously
- Switch between different adapters for different requests
- Use LoRA with quantized models
- Support both HuggingFace and NeMo LoRA formats

## Basic Usage

### Single LoRA Adapter

```python
from tensorrt_llm import LLM
from tensorrt_llm.lora_helper import LoraConfig
from tensorrt_llm.executor.request import LoRARequest
from tensorrt_llm.sampling_params import SamplingParams

# Configure LoRA
lora_config = LoraConfig(
    lora_dir=["/path/to/lora/adapter"],
    max_lora_rank=8,
    max_loras=1,
    max_cpu_loras=1
)

# Initialize LLM with LoRA support
llm = LLM(
    model="/path/to/base/model",
    lora_config=lora_config
)

# Create LoRA request
lora_request = LoRARequest("my-lora-task", 0, "/path/to/lora/adapter")

# Generate with LoRA
prompts = ["Hello, how are you?"]
sampling_params = SamplingParams(max_tokens=50)

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_request]
)
```

### Multi-LoRA Support

```python
# Configure for multiple LoRA adapters
lora_config = LoraConfig(
    lora_target_modules=['attn_q', 'attn_k', 'attn_v'],
    max_lora_rank=8,
    max_loras=4,
    max_cpu_loras=8
)

llm = LLM(model="/path/to/base/model", lora_config=lora_config)

# Create multiple LoRA requests
lora_req1 = LoRARequest("task-1", 0, "/path/to/adapter1")
lora_req2 = LoRARequest("task-2", 1, "/path/to/adapter2")

prompts = [
    "Translate to French: Hello world",
    "Summarize: This is a long document..."
]

# Apply different LoRAs to different prompts
outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_req1, lora_req2]
)
```

## Advanced Usage

### LoRA with Quantization

```python
from tensorrt_llm.models.modeling_utils import QuantConfig
from tensorrt_llm.quantization.mode import QuantAlgo

# Configure quantization
quant_config = QuantConfig(
    quant_algo=QuantAlgo.FP8,
    kv_cache_quant_algo=QuantAlgo.FP8
)

# LoRA works with quantized models
llm = LLM(
    model="/path/to/model",
    quant_config=quant_config,
    lora_config=lora_config
)
```

### NeMo LoRA Format

```python
# For NeMo-format LoRA checkpoints
lora_config = LoraConfig(
    lora_dir=["/path/to/nemo/lora"],
    lora_ckpt_source="nemo",
    max_lora_rank=8
)

lora_request = LoRARequest(
    "nemo-task",
    0,
    "/path/to/nemo/lora",
    lora_ckpt_source="nemo"
)
```

### Cache Management

```python
from tensorrt_llm.llmapi.llm_args import PeftCacheConfig

# Fine-tune cache sizes
peft_cache_config = PeftCacheConfig(
    host_cache_size=1024*1024*1024,  # 1GB CPU cache
    device_cache_percent=0.1          # 10% of free GPU memory
)

llm = LLM(
    model="/path/to/model",
    lora_config=lora_config,
    peft_cache_config=peft_cache_config
)
```

## TRTLLM serve with LoRA

### YAML Configuration

Create a `config.yaml` file:

```yaml
lora_config:
  lora_target_modules: ['attn_q', 'attn_k', 'attn_v']
  max_lora_rank: 8
```

### Starting the Server

```bash
python -m tensorrt_llm.commands.serve
     /path/to/model \
    --config config.yaml
```

### Client Usage

```python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.completions.create(
    model="/path/to/model",
    prompt="What is the capital city of France?",
    max_tokens=20,
    extra_body={
        "lora_request": {
            "lora_name": "lora-example-0",
            "lora_int_id": 0,
            "lora_path": "/path/to/lora_adapter"
        }
    },
)
```

## TRTLLM bench with LORA

### YAML Configuration

Create a `config.yaml` file:

```yaml
lora_config:
  lora_dir:
    - /workspaces/tensorrt_llm/loras/0
  max_lora_rank: 64
  max_loras: 8
  max_cpu_loras: 8
  lora_target_modules:
    - attn_q
    - attn_k
    - attn_v
  trtllm_modules_to_hf_modules:
    attn_q: q_proj
    attn_k: k_proj
    attn_v: v_proj
```

### Run trtllm-bench

```bash
trtllm-bench --model $model_path throughput --dataset $dataset_path --config config.yaml --num_requests 64 --concurrency 16
```

---

# Overlap Scheduler

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

## How It Works

At step *n*, the system launches GPU computation for step *n+1* without waiting for CPU tasks (e.g., stop criteria checks) from step *n* to complete. This allows:

- CPU work (step *n*) and GPU computation (step *n+1*) to run concurrently.
- Better GPU occupancy by reducing idle time.

## Tradeoff

The optimization introduces one extra decoding step but significantly improves throughput.

## Usage

Enabled by default. To disable, set `disable_overlap_scheduler=True` in the configuration.


## References

- [NanoFlow: Towards Optimal Large Language Model Serving Throughput](https://arxiv.org/abs/2408.12757)
- https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler

---

# Quantization

The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub,
which are generated by [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).

```python
from tensorrt_llm._torch import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")
```

Or you can try the following commands to get a quantized model by yourself:

```bash
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
```

---

# Sampling

The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, stop words, bad words, penalty, context and generation logits, and log probs.

The following example prepares two identical prompts which will give different results due to the sampling parameters chosen:

```python
from tensorrt_llm import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
sampling_params = SamplingParams(
        temperature=1.0,
        top_k=8,
        top_p=0.5,
    )
llm.generate(["Hello, my name is",
            "Hello, my name is"], sampling_params)
```

When using speculative decoders such as MTP or Eagle-3 the subset of sampling options available is more restricted.

---

# KV Cache Manager

In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to optimize decoding efficiency, particularly during autoregressive generation tasks.
Since KV Cache requires memory to store, it is also an important resource.
In TensorRT LLM, KV Cache is managed by the `KVCacheManager`.

For details of the TensorRT LLM `KVCacheManager` implementation see [KV Cache Management](../advanced/kv-cache-management.md).

## KV Cache Manager Introduction

`KVCacheManager` is a type of resource manager, inheriting from `BaseResourceManager`.
Therefore, it implements the interfaces declared by `BaseResourceManager`.

Note: As the project evolves, these interfaces may change.

## Interfaces

The interfaces from `BaseResourceManager` include:

- **prepare_resources**: Called at each step before model forward in `PyExecutor` for the current batch.
  In `KVCacheManager`, this involves allocating KV Cache memory. This allocation varies depending on the request type.
  For requests entering the context phase for the first time, KV Cache needs to be allocated for the entire context.
  For requests already in the generation phase, KV Cache is allocated for the upcoming step.
  If KV Cache is organized in blocks and free space is available within a block, actual allocation may not occur.
- **update_resources**: Called at the end of each step for the current batch to update allocated resources.
  For KV Cache, updates may not be necessary, so this function currently performs no operations.
  If KV Cache reuse is supported in Python, updates like KV Cache Radix Tree management occurs here.
- **free_resources**: Called when a request finishes to free the resources allocated for that request.
  For KV Cache, if reuse is not enabled, the KV Cache memory used by the request should be recycled.
  In the C++ binding implementation, this might involve calling the binding's `remove_sequence` method to free the KV Cache memory related to that request.


There are also two interfaces designed for `CapacityScheduler`:

- **get_max_resource_count**: Queries the maximum number of resources available. For `KVCacheManager`, this is usually the maximum number of KV Cache blocks.
- **get_needed_resource_to_completion**: Computes the resources needed for a single request to complete.
  `CapacityScheduler` uses this to sum up the total resources needed and determine if new requests can be accommodated.

In addition to the `BaseResourceManager` interfaces, `KVCacheManager` has interfaces related to the `ModelEngine` in use.
For `PyTorchModelEngine`, common interfaces include:

- **get_batch_cache_indices**: Takes a list of `LlmRequest` and returns a `Dict[List[int]]`, indicating the block IDs for each request.
- **get_buffers**: Returns the buffer of the KV Cache pool for a given layer, used by the attention backend. The shape might be [`num_blocks`, 2, `num_tokens_per_block`, `num_kv_heads`, `head_dim`].
- **get_num_free_blocks**: Returns the number of free blocks available for allocation.

There are also interfaces for warming up `PyTorchModelEngine`, especially when using CUDA graphs:

- **add_padding_request**: Adds a sequence of context length 1 to KV Cache as a warmup request.
  This is optional if CUDA Graph is not used in your proof of concept.

## Customize KV Cache Manager

To customize `KVCacheManager`, implement all the necessary interfaces.
Then, integrate it into the `PyExecutor`. For the PyTorch backend, the relevant code is in [pytorch_model_registry.py](../../../tensorrt_llm/_torch/pyexecutor/backend_registries/pytorch_model_registry.py).
In the `create_pytorch_model_based_executor` function, the `KVCacheManager` is instantiated as follows:

```python
    kv_cache_manager = KVCacheManager(
        executor_config.kv_cache_config,
        tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
        num_layers=model_engine.model.config.num_hidden_layers,
        num_kv_heads=model_engine.model.config.num_key_value_heads,
        head_dim=head_dim,
        tokens_per_block=tokens_per_block,
        max_seq_len=max_seq_len,
        max_batch_size=max_num_requests,
        mapping=mapping,
        dtype=kv_cache_dtype,
    )
```

For local testing or proof of concept, update these lines to use your implementation.
Then, test it to ensure the `PyExecutor` runs with your customized `KVCacheManager`.

---

# Scheduler

TensorRT LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step.
The scheduler is invoked to determine which requests are scheduled at the current step.

## Scheduler Introduction

There are two kinds of schedulers:

- `CapacityScheduler`: This scheduler decides if resources should be allocated for each active request.
It considers the KV cache capacity and other resources, if applicable.
The input to `CapacityScheduler` includes all active requests that need processing.
The primary output is `fitting_requests`, representing the requests for which resources are reserved at the current step.
Another output is `paused_requests`, which supports request pausing in the C++ runtime.
- `MicroBatchScheduler`: This scheduler selects some requests from `fitting_requests` chosen by `CapacityScheduler`.
Another input is `inflight_request_ids`, which supports pipeline parallelism or overlapped execution in the C++ runtime.
Since PyTorch Flow does not support pipeline parallelism, `inflight_request_ids` is an empty set.
The outputs are `context_requests` and `generation_requests`, which are the scheduled context and generation requests.
Requests not in these lists are not selected for the model forward pass.

`SimpleScheduler` combines these two schedulers, first using `CapacityScheduler` and then `MicroBatchScheduler`, to get the final schedule result.
The inputs to `SimpleScheduler` include `active_requests` and `inflight_request_ids`, and the outputs are `context_requests`, `generation_requests`, and `paused_requests`.

## Customize Your Own Scheduler

To customize the scheduler or batching mechanism, implement your own `CapacityScheduler` and `MicroBatchScheduler` by inheriting their respective classes.
If two-step scheduling is unnecessary, inherit `RequestScheduler` and implement `schedule_request` directly.

An example of a `CapacityScheduler` implementation is the `GuaranteedNoEvictScheduler` class, found in [scheduler.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/scheduler.py).
This class was used before the C++ binding of `CapacityScheduler` and initially employed a Python-based scheduler.
It inherits `CapacityScheduler` and implements its own `schedule_request` method.
This method processes all `active_requests` and tries to schedule more requests that can fit in the KV cache.
Resource estimation should align with resource allocation and deallocation in `kv_cache_manager`.

Here is the code snippet:

```python
class GuaranteedNoEvictScheduler(CapacityScheduler):
    # only schedule requests has no_schedule_until_state <= state < no_schedule_after_state
    no_schedule_until_state = LlmRequestState.CONTEXT_INIT
    no_schedule_after_state = LlmRequestState.GENERATION_COMPLETE

    def __init__(self, max_num_requests: int, kv_cache_manager):
        super(GuaranteedNoEvictScheduler, self).__init__()
        self.max_num_requests = max_num_requests
        self.kv_cache_manager = kv_cache_manager

    def schedule_request(
        self, active_requests: RequestList
    ) -> tuple[list[LlmRequest], list[LlmRequest]]:
        scheduled_requests = []
        pending_requests = []
        reserved_blocks = 0
        max_blocks = self.kv_cache_manager.get_max_resource_count()
        for request in active_requests:
            req_state = request.state
            # if request cannot be scheduled yet or request should no longer be scheduled, skip
            if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value:
                continue

            if len(scheduled_requests
                   ) >= self.max_num_requests or reserved_blocks >= max_blocks:
                break
            elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE:
                scheduled_requests.append(request)
                reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion(
                    request)
            else:
                pending_requests.append(request)

        avaiable_blocks = max_blocks - reserved_blocks
        for request in pending_requests:
            req_state = request.state
            if len(scheduled_requests) >= self.max_num_requests:
                break
            elif req_state == LlmRequestState.CONTEXT_INIT:
                needed_blocks = self.kv_cache_manager.get_needed_resource_to_completion(
                    request)
                if needed_blocks <= avaiable_blocks:
                    scheduled_requests.append(request)
                    avaiable_blocks -= needed_blocks
                elif needed_blocks > avaiable_blocks:
                    # If one requests fails to be scheduled, break
                    break

        assert len(scheduled_requests) > 0, (
            "no pending request can get enough resource to complete, "
            "please increase KV cache pool size.")
        return scheduled_requests, []
```

After implementing your own scheduler, integrate it into the PyExecutor.
For the PyTorch backend, the code is in [py_executor_creator.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py).
In the `create_pytorch_model_based_executor` function, there are two lines creating `CapacityScheduler`:

```python
    capacitor_scheduler = BindCapacityScheduler(max_num_requests,
                                                kv_cache_manager.impl)
```

Similar adjustments can be made for `MicroBatchScheduler`. This allows the `PyExecutor` to execute with your customized scheduling logic.