# Tensorrt Llm > Reusable note sections for docs. --- .. Reusable note sections for docs. Include specific notes using: .. include:: /note_sections.rst :start-after: .. start-note- :end-before: .. end-note- .. start-note-config-flag-alias .. note:: **Non-breaking**: ``--config `` is the preferred flag for passing a :ref:`YAML configuration file `. Existing workflows using ``--extra_llm_api_options `` continue to work; it is an equivalent alias. .. end-note-config-flag-alias .. start-note-traffic-patterns .. note:: **Traffic Patterns**: The ISL (Input Sequence Length) and OSL (Output Sequence Length) values in each configuration represent the **maximum supported values** for that config. Requests exceeding these limits may result in errors. To handle requests with input sequences **longer than the configured ISL**, add the following to your config file: .. code-block:: yaml enable_chunked_prefill: true This enables chunked prefill, which processes long input sequences in chunks rather than requiring them to fit within a single prefill operation. Note that enabling chunked prefill does **not** guarantee optimal performance—these configs are tuned for the specified ISL/OSL. .. end-note-traffic-patterns .. start-note-quick-start-isl-osl .. note:: The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the :ref:`Preconfigured Recipes` section below which covers a larger set of traffic patterns and performance profiles. .. end-note-quick-start-isl-osl --- trtllm-bench =========================== trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It provides three main subcommands for different benchmarking scenarios: .. include:: ../_includes/note_sections.rst :start-after: .. start-note-config-flag-alias :end-before: .. end-note-config-flag-alias Syntax ------ .. click:: tensorrt_llm.commands.bench:main :prog: trtllm-bench :nested: full :commands: throughput, latency, build Dataset preparation ------------------ prepare_dataset.py ^^^^^^^^^^^^^^^^^^ trtllm-bench is designed to work with the `prepare_dataset.py `_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports: **Dataset Types:** - Real datasets from various sources - Synthetic datasets with normal or uniform token distributions - LoRA task-specific datasets **Key Features:** - Tokenizer integration for proper text preprocessing - Configurable random seeds for reproducible results - Support for LoRA adapters and task IDs - Output in JSON format compatible with trtllm-bench .. important:: The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format. **Usage:** prepare_dataset """"""""""""""" .. code-block:: bash python prepare_dataset.py [OPTIONS] **Options** ---- .. list-table:: :widths: 20 80 :header-rows: 1 * - Option - Description * - ``--tokenizer`` - Tokenizer directory or HuggingFace model name (required) * - ``--output`` - Output JSON filename (default: preprocessed_dataset.json) * - ``--stdout`` - Print output to stdout with JSON dataset entry on each line (**required for trtllm-bench**) * - ``--random-seed`` - Random seed for token generation (default: 420) * - ``--task-id`` - LoRA task ID (default: -1) * - ``--rand-task-id`` - Random LoRA task range (two integers) * - ``--lora-dir`` - Directory containing LoRA adapters * - ``--log-level`` - Logging level: info or debug (default: info) dataset """"""" Process real datasets from various sources. .. code-block:: bash python prepare_dataset.py dataset [OPTIONS] **Options** ---- .. list-table:: :widths: 20 80 :header-rows: 1 * - Option - Description * - ``--input`` - Input dataset file or directory (required) * - ``--max-input-length`` - Maximum input sequence length (default: 2048) * - ``--max-output-length`` - Maximum output sequence length (default: 512) * - ``--num-samples`` - Number of samples to process (default: all) * - ``--format`` - Input format: json, jsonl, csv, or txt (default: auto-detect) token_norm_dist """"""""""""""" Generate synthetic datasets with normal token distribution. .. code-block:: bash python prepare_dataset.py token_norm_dist [OPTIONS] **Options** ---- .. list-table:: :widths: 20 80 :header-rows: 1 * - Option - Description * - ``--num-requests`` - Number of requests to be generated (required) * - ``--input-mean`` - Normal distribution mean for input tokens (required) * - ``--input-stdev`` - Normal distribution standard deviation for input tokens (required) * - ``--output-mean`` - Normal distribution mean for output tokens (required) * - ``--output-stdev`` - Normal distribution standard deviation for output tokens (required) token_unif_dist """"""""""""""" Generate synthetic datasets with uniform token distribution .. code-block:: bash python prepare_dataset.py token_unif_dist [OPTIONS] **Options** ---- .. list-table:: :widths: 20 80 :header-rows: 1 * - Option - Description * - ``--num-requests`` - Number of requests to be generated (required) * - ``--input-min`` - Uniform distribution minimum for input tokens (required) * - ``--input-max`` - Uniform distribution maximum for input tokens (required) * - ``--output-min`` - Uniform distribution minimum for output tokens (required) * - ``--output-max`` - Uniform distribution maximum for output tokens (required) --- trtllm-build =========================== .. argparse:: :module: tensorrt_llm.commands.build :func: parse_arguments :prog: trtllm-build --- trtllm-eval =========== About ----- The ``trtllm-eval`` command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with the `accuracy test suite `_ of TensorRT LLM. ``trtllm-eval`` is built on the offline API -- LLM API. Compared to the online ``trtllm-serve``, the offline API provides clearer error messages and simplifies the debugging workflow. The following tasks are currently supported: .. list-table:: :header-rows: 1 :widths: 20 25 15 15 15 * - Dataset - Task - Metric - Default ISL - Default OSL * - CNN Dailymail - summarization - rouge - 924 - 100 * - MMLU - QA; multiple choice - accuracy - 4,094 - 2 * - GSM8K - QA; regex matching - accuracy - 4,096 - 256 * - GPQA - QA; multiple choice - accuracy - 32,768 - 4,096 * - JSON mode eval - structured generation - accuracy - 1,024 - 512 .. note:: ``trtllm-eval`` originates from the TensorRT LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time, ``trtllm-eval`` is intended solely for development and is not recommended for production use. Usage and Examples ------------------ Some evaluation tasks (e.g., GSM8K and GPQA) depend on the ``lm_eval`` package. To run these tasks, you need to install ``lm_eval`` with: .. code-block:: bash pip install -r requirements-dev.txt Alternatively, you can install the ``lm_eval`` version specified in ``requirements-dev.txt``. Here are some examples: .. code-block:: bash # Evaluate Llama-3.1-8B-Instruct on MMLU trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu # Evaluate Llama-3.1-8B-Instruct on GSM8K trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k # Evaluate Llama-3.3-70B-Instruct on GPQA Diamond trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond The ``--model`` argument accepts either a Hugging Face model ID or a local checkpoint path. By default, ``trtllm-eval`` runs the model with the PyTorch backend; you can pass ``--backend tensorrt`` to switch to the TensorRT backend. Alternatively, the ``--model`` argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the ``--tokenizer`` argument. For more details, see ``trtllm-eval --help`` and ``trtllm-eval --help``. .. include:: ../_includes/note_sections.rst :start-after: .. start-note-config-flag-alias :end-before: .. end-note-config-flag-alias Syntax ------ .. click:: tensorrt_llm.commands.eval:main :prog: trtllm-eval :nested: full --- trtllm-serve ======================= .. toctree:: :maxdepth: 1 trtllm-serve run-benchmark-with-trtllm-serve --- trtllm-serve ============ About ----- The ``trtllm-serve`` command starts an OpenAI compatible server that supports the following endpoints: - ``/v1/models`` - ``/v1/completions`` - ``/v1/chat/completions`` For information about the inference endpoints, refer to the `OpenAI API Reference `__. The server also supports the following endpoints: - ``/health`` - ``/metrics`` - ``/version`` The ``metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details. Starting a Server ----------------- The following abbreviated command syntax shows the commonly used arguments to start a server: .. code-block:: bash trtllm-serve [--tp_size --pp_size --ep_size --host --port ] For the full syntax and argument descriptions, refer to :ref:`syntax`. Inference Endpoints ------------------- After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 `_ for examples in the following sections. Chat API ~~~~~~~~ You can query Chat API with any http clients, a typical example is OpenAI Python client: .. literalinclude:: ../../../../examples/serve/openai_chat_client.py :language: python :linenos: Another example uses ``curl``: .. literalinclude:: ../../../../examples/serve/curl_chat_client.sh :language: bash :linenos: Completions API ~~~~~~~~~~~~~~~ You can query Completions API with any http clients, a typical example is OpenAI Python client: .. literalinclude:: ../../../../examples/serve/openai_completion_client.py :language: python :linenos: Another example uses ``curl``: .. literalinclude:: ../../../../examples/serve/curl_completion_client.sh :language: bash :linenos: Responses API ~~~~~~~~~~~~~~~ You can query Responses API with any http clients, a typical example is OpenAI Python client: .. literalinclude:: ../../../../examples/serve/openai_responses_client.py :language: python :linenos: Another example uses ``curl``: .. literalinclude:: ../../../../examples/serve/curl_responses_client.sh :language: bash :linenos: More openai compatible examples can be found in the `compatibility examples `_ directory. Multimodal Serving ~~~~~~~~~~~~~~~~~~ For multimodal models, you need to create a configuration file and start the server with additional options due to the following limitations: * TRT-LLM multimodal is currently not compatible with ``kv_cache_reuse`` * Multimodal models require ``chat_template``, so only the Chat API is supported To set up multimodal models: First, create a configuration file: .. code-block:: bash cat >./config.yml<`__ for implementation details. **Video** * Using "video_url": .. code-block:: json {"role": "user", "content": [ {"type": "text", "text": "What's in this video?"}, {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}} ]} **Audio** * Using "audio_url": .. code-block:: json {"role": "user", "content": [ {"type": "text", "text": "What's in this audio?"}, {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.mp3"}} ]} Multi-node Serving with Slurm ----------------------------- You can deploy `DeepSeek-V3 `_ model across two nodes with Slurm and ``trtllm-serve`` .. code-block:: bash echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true" > config.yml srun -N 2 -w [NODES] \ --output=benchmark_2node.log \ --ntasks 16 --ntasks-per-node=8 \ --mpi=pmix --gres=gpu:8 \ --container-image= \ --container-mounts=/workspace:/workspace \ --container-workdir /workspace \ bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --config ./config.yml" See `the source code `_ of ``trtllm-llmapi-launch`` for more details. Metrics Endpoint ---------------- .. note:: The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend. Some fields, such as CPU memory usage, are not yet available for the PyTorch backend. Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration. The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details. For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file: .. code-block:: yaml # extra_llm_config.yaml enable_iter_perf_stats: true Start the server and specify the ``--config`` argument with the path to the YAML file: .. code-block:: bash trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --config config.yaml After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint. Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed. .. code-block:: bash curl -X GET http://localhost:8000/metrics Example output: .. code-block:: json [ { "gpuMemUsage": 76665782272, "iter": 154, "iterLatencyMS": 7.00688362121582, "kvCacheStats": { "allocNewBlocks": 3126, "allocTotalBlocks": 3126, "cacheHitRate": 0.00128, "freeNumBlocks": 101253, "maxNumBlocks": 101256, "missedBlocks": 3121, "reusedBlocks": 4, "tokensPerBlock": 32, "usedNumBlocks": 3 }, "numActiveRequests": 1 ... } ] .. _configuring-with-yaml-files: Configuring with YAML Files ---------------------------- You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. The arguments in the file override the corresponding command line arguments. .. include:: ../../_includes/note_sections.rst :start-after: .. start-note-config-flag-alias :end-before: .. end-note-config-flag-alias The yaml file is configuration of `tensorrt_llm.llmapi.LlmArgs `_, the class has multiple levels of hierarchy, to configure the top level arguments like ``max_batch_size``, the yaml file should be like: .. code-block:: yaml max_batch_size: 8 To configure the nested level arguments like ``moe_config.backend``, the yaml file should be like: .. code-block:: yaml moe_config: backend: CUTLASS Syntax ------ .. click:: tensorrt_llm.commands.serve:main :prog: trtllm-serve :nested: full Besides the above examples, `trtllm-serve` is also used as an entrypoint for performance benchmarking. Please refer to `Performance Benchmarking with `trtllm-serve` ` for more details. --- .. start-config-table-note .. include:: ../_includes/note_sections.rst :start-after: .. start-note-traffic-patterns :end-before: .. end-note-traffic-patterns .. end-config-table-note .. start-deepseek-ai/DeepSeek-R1-0528 .. _deepseek-ai/DeepSeek-R1-0528: `DeepSeek-R1 `_ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :widths: 12 15 15 13 20 25 * - GPU - Performance Profile - ISL / OSL - Concurrency - Config - Command * - 8xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp8_conc4.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp8_conc8.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Balanced - 1024 / 1024 - 16 - `1k1k_tp8_conc16.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml`` * - 8xB200_NVL - High Throughput - 1024 / 1024 - 32 - `1k1k_tp8_conc32.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc32.yaml`` * - 8xB200_NVL - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp8_conc64.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc64.yaml`` * - 8xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp8_conc4.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp8_conc8.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Balanced - 8192 / 1024 - 16 - `8k1k_tp8_conc16.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc16.yaml`` * - 8xB200_NVL - High Throughput - 8192 / 1024 - 32 - `8k1k_tp8_conc32.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc32.yaml`` * - 8xB200_NVL - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp8_conc64.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc64.yaml`` * - 8xH200_SXM - Min Latency - 1024 / 1024 - 4 - `1k1k_tp8_conc4.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc4.yaml`` * - 8xH200_SXM - Low Latency - 1024 / 1024 - 8 - `1k1k_tp8_conc8.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc8.yaml`` * - 8xH200_SXM - Balanced - 1024 / 1024 - 16 - `1k1k_tp8_conc16.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc16.yaml`` * - 8xH200_SXM - High Throughput - 1024 / 1024 - 32 - `1k1k_tp8_conc32.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc32.yaml`` * - 8xH200_SXM - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp8_conc64.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc64.yaml`` * - 8xH200_SXM - Min Latency - 8192 / 1024 - 4 - `8k1k_tp8_conc4.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc4.yaml`` * - 8xH200_SXM - Low Latency - 8192 / 1024 - 8 - `8k1k_tp8_conc8.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc8.yaml`` * - 8xH200_SXM - Balanced - 8192 / 1024 - 16 - `8k1k_tp8_conc16.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc16.yaml`` * - 8xH200_SXM - High Throughput - 8192 / 1024 - 32 - `8k1k_tp8_conc32.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc32.yaml`` * - 8xH200_SXM - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp8_conc64.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc64.yaml`` .. end-deepseek-ai/DeepSeek-R1-0528 .. start-nvidia/DeepSeek-R1-0528-FP4-v2 .. _nvidia/DeepSeek-R1-0528-FP4-v2: `DeepSeek-R1 (NVFP4) `_ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :widths: 12 15 15 13 20 25 * - GPU - Performance Profile - ISL / OSL - Concurrency - Config - Command * - 4xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp4_conc4.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc4.yaml`` * - 4xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp4_conc8.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc8.yaml`` * - 4xB200_NVL - Low Latency - 1024 / 1024 - 16 - `1k1k_tp4_conc16.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc16.yaml`` * - 4xB200_NVL - Balanced - 1024 / 1024 - 32 - `1k1k_tp4_conc32.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc32.yaml`` * - 4xB200_NVL - High Throughput - 1024 / 1024 - 64 - `1k1k_tp4_conc64.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc64.yaml`` * - 4xB200_NVL - High Throughput - 1024 / 1024 - 128 - `1k1k_tp4_conc128.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc128.yaml`` * - 4xB200_NVL - Max Throughput - 1024 / 1024 - 256 - `1k1k_tp4_conc256.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc256.yaml`` * - 4xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp4_conc4.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc4.yaml`` * - 4xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp4_conc8.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc8.yaml`` * - 4xB200_NVL - Low Latency - 8192 / 1024 - 16 - `8k1k_tp4_conc16.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc16.yaml`` * - 4xB200_NVL - Balanced - 8192 / 1024 - 32 - `8k1k_tp4_conc32.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc32.yaml`` * - 4xB200_NVL - High Throughput - 8192 / 1024 - 64 - `8k1k_tp4_conc64.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc64.yaml`` * - 4xB200_NVL - High Throughput - 8192 / 1024 - 128 - `8k1k_tp4_conc128.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc128.yaml`` * - 4xB200_NVL - Max Throughput - 8192 / 1024 - 256 - `8k1k_tp4_conc256.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc256.yaml`` * - 8xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp8_conc4.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp8_conc8.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Low Latency - 1024 / 1024 - 16 - `1k1k_tp8_conc16.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc16.yaml`` * - 8xB200_NVL - Balanced - 1024 / 1024 - 32 - `1k1k_tp8_conc32.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc32.yaml`` * - 8xB200_NVL - High Throughput - 1024 / 1024 - 64 - `1k1k_tp8_conc64.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc64.yaml`` * - 8xB200_NVL - High Throughput - 1024 / 1024 - 128 - `1k1k_tp8_conc128.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc128.yaml`` * - 8xB200_NVL - Max Throughput - 1024 / 1024 - 256 - `1k1k_tp8_conc256.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc256.yaml`` * - 8xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp8_conc4.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp8_conc8.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Low Latency - 8192 / 1024 - 16 - `8k1k_tp8_conc16.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc16.yaml`` * - 8xB200_NVL - Balanced - 8192 / 1024 - 32 - `8k1k_tp8_conc32.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc32.yaml`` * - 8xB200_NVL - High Throughput - 8192 / 1024 - 64 - `8k1k_tp8_conc64.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc64.yaml`` * - 8xB200_NVL - High Throughput - 8192 / 1024 - 128 - `8k1k_tp8_conc128.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc128.yaml`` * - 8xB200_NVL - Max Throughput - 8192 / 1024 - 256 - `8k1k_tp8_conc256.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-0528-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc256.yaml`` .. end-nvidia/DeepSeek-R1-0528-FP4-v2 .. start-openai/gpt-oss-120b .. _openai/gpt-oss-120b: `gpt-oss-120b `_ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :widths: 12 15 15 13 20 25 * - GPU - Performance Profile - ISL / OSL - Concurrency - Config - Command * - B200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc4.yaml`` * - B200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc8.yaml`` * - B200_NVL - Balanced - 1024 / 1024 - 16 - `1k1k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc16.yaml`` * - B200_NVL - High Throughput - 1024 / 1024 - 32 - `1k1k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc32.yaml`` * - B200_NVL - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc64.yaml`` * - B200_NVL - Min Latency - 1024 / 8192 - 4 - `1k8k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc4.yaml`` * - B200_NVL - Low Latency - 1024 / 8192 - 8 - `1k8k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc8.yaml`` * - B200_NVL - Balanced - 1024 / 8192 - 16 - `1k8k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc16.yaml`` * - B200_NVL - High Throughput - 1024 / 8192 - 32 - `1k8k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc32.yaml`` * - B200_NVL - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc64.yaml`` * - B200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc4.yaml`` * - B200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc8.yaml`` * - B200_NVL - Balanced - 8192 / 1024 - 16 - `8k1k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc16.yaml`` * - B200_NVL - High Throughput - 8192 / 1024 - 32 - `8k1k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc32.yaml`` * - B200_NVL - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp1_conc64.yaml`` * - 2xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc4.yaml`` * - 2xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc8.yaml`` * - 2xB200_NVL - Balanced - 1024 / 1024 - 16 - `1k1k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc16.yaml`` * - 2xB200_NVL - High Throughput - 1024 / 1024 - 32 - `1k1k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc32.yaml`` * - 2xB200_NVL - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc64.yaml`` * - 2xB200_NVL - Min Latency - 1024 / 8192 - 4 - `1k8k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc4.yaml`` * - 2xB200_NVL - Low Latency - 1024 / 8192 - 8 - `1k8k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc8.yaml`` * - 2xB200_NVL - Balanced - 1024 / 8192 - 16 - `1k8k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc16.yaml`` * - 2xB200_NVL - High Throughput - 1024 / 8192 - 32 - `1k8k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc32.yaml`` * - 2xB200_NVL - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc64.yaml`` * - 2xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc4.yaml`` * - 2xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc8.yaml`` * - 2xB200_NVL - Balanced - 8192 / 1024 - 16 - `8k1k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc16.yaml`` * - 2xB200_NVL - High Throughput - 8192 / 1024 - 32 - `8k1k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc32.yaml`` * - 2xB200_NVL - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp2_conc64.yaml`` * - 4xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc4.yaml`` * - 4xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc8.yaml`` * - 4xB200_NVL - Balanced - 1024 / 1024 - 16 - `1k1k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc16.yaml`` * - 4xB200_NVL - High Throughput - 1024 / 1024 - 32 - `1k1k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc32.yaml`` * - 4xB200_NVL - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc64.yaml`` * - 4xB200_NVL - Min Latency - 1024 / 8192 - 4 - `1k8k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc4.yaml`` * - 4xB200_NVL - Low Latency - 1024 / 8192 - 8 - `1k8k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc8.yaml`` * - 4xB200_NVL - Balanced - 1024 / 8192 - 16 - `1k8k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc16.yaml`` * - 4xB200_NVL - High Throughput - 1024 / 8192 - 32 - `1k8k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc32.yaml`` * - 4xB200_NVL - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp4_conc64.yaml`` * - 4xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc4.yaml`` * - 4xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc8.yaml`` * - 4xB200_NVL - Balanced - 8192 / 1024 - 16 - `8k1k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc16.yaml`` * - 4xB200_NVL - High Throughput - 8192 / 1024 - 32 - `8k1k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc32.yaml`` * - 4xB200_NVL - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp4_conc64.yaml`` * - 8xB200_NVL - Min Latency - 1024 / 1024 - 4 - `1k1k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 1024 / 1024 - 8 - `1k1k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Balanced - 1024 / 1024 - 16 - `1k1k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc16.yaml`` * - 8xB200_NVL - High Throughput - 1024 / 1024 - 32 - `1k1k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc32.yaml`` * - 8xB200_NVL - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc64.yaml`` * - 8xB200_NVL - Min Latency - 1024 / 8192 - 4 - `1k8k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 1024 / 8192 - 8 - `1k8k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc8.yaml`` * - 8xB200_NVL - Balanced - 1024 / 8192 - 16 - `1k8k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc16.yaml`` * - 8xB200_NVL - High Throughput - 1024 / 8192 - 32 - `1k8k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc32.yaml`` * - 8xB200_NVL - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp8_conc64.yaml`` * - 8xB200_NVL - Min Latency - 8192 / 1024 - 4 - `8k1k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc4.yaml`` * - 8xB200_NVL - Low Latency - 8192 / 1024 - 8 - `8k1k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc8.yaml`` * - 8xB200_NVL - Balanced - 8192 / 1024 - 16 - `8k1k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc16.yaml`` * - 8xB200_NVL - High Throughput - 8192 / 1024 - 32 - `8k1k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc32.yaml`` * - 8xB200_NVL - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/B200/8k1k_tp8_conc64.yaml`` * - H200_SXM - Min Latency - 1024 / 1024 - 4 - `1k1k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc4.yaml`` * - H200_SXM - Low Latency - 1024 / 1024 - 8 - `1k1k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc8.yaml`` * - H200_SXM - Balanced - 1024 / 1024 - 16 - `1k1k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc16.yaml`` * - H200_SXM - High Throughput - 1024 / 1024 - 32 - `1k1k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc32.yaml`` * - H200_SXM - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp1_conc64.yaml`` * - H200_SXM - Min Latency - 1024 / 8192 - 4 - `1k8k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc4.yaml`` * - H200_SXM - Low Latency - 1024 / 8192 - 8 - `1k8k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc8.yaml`` * - H200_SXM - Balanced - 1024 / 8192 - 16 - `1k8k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc16.yaml`` * - H200_SXM - High Throughput - 1024 / 8192 - 32 - `1k8k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc32.yaml`` * - H200_SXM - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp1_conc64.yaml`` * - H200_SXM - Min Latency - 8192 / 1024 - 4 - `8k1k_tp1_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc4.yaml`` * - H200_SXM - Low Latency - 8192 / 1024 - 8 - `8k1k_tp1_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc8.yaml`` * - H200_SXM - Balanced - 8192 / 1024 - 16 - `8k1k_tp1_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc16.yaml`` * - H200_SXM - High Throughput - 8192 / 1024 - 32 - `8k1k_tp1_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc32.yaml`` * - H200_SXM - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp1_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp1_conc64.yaml`` * - 2xH200_SXM - Min Latency - 1024 / 1024 - 4 - `1k1k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc4.yaml`` * - 2xH200_SXM - Low Latency - 1024 / 1024 - 8 - `1k1k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc8.yaml`` * - 2xH200_SXM - Balanced - 1024 / 1024 - 16 - `1k1k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc16.yaml`` * - 2xH200_SXM - High Throughput - 1024 / 1024 - 32 - `1k1k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc32.yaml`` * - 2xH200_SXM - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp2_conc64.yaml`` * - 2xH200_SXM - Min Latency - 1024 / 8192 - 4 - `1k8k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc4.yaml`` * - 2xH200_SXM - Low Latency - 1024 / 8192 - 8 - `1k8k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc8.yaml`` * - 2xH200_SXM - Balanced - 1024 / 8192 - 16 - `1k8k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc16.yaml`` * - 2xH200_SXM - High Throughput - 1024 / 8192 - 32 - `1k8k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc32.yaml`` * - 2xH200_SXM - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp2_conc64.yaml`` * - 2xH200_SXM - Min Latency - 8192 / 1024 - 4 - `8k1k_tp2_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc4.yaml`` * - 2xH200_SXM - Low Latency - 8192 / 1024 - 8 - `8k1k_tp2_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc8.yaml`` * - 2xH200_SXM - Balanced - 8192 / 1024 - 16 - `8k1k_tp2_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc16.yaml`` * - 2xH200_SXM - High Throughput - 8192 / 1024 - 32 - `8k1k_tp2_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc32.yaml`` * - 2xH200_SXM - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp2_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp2_conc64.yaml`` * - 4xH200_SXM - Min Latency - 1024 / 1024 - 4 - `1k1k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc4.yaml`` * - 4xH200_SXM - Low Latency - 1024 / 1024 - 8 - `1k1k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc8.yaml`` * - 4xH200_SXM - Balanced - 1024 / 1024 - 16 - `1k1k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc16.yaml`` * - 4xH200_SXM - High Throughput - 1024 / 1024 - 32 - `1k1k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc32.yaml`` * - 4xH200_SXM - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp4_conc64.yaml`` * - 4xH200_SXM - Min Latency - 1024 / 8192 - 4 - `1k8k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc4.yaml`` * - 4xH200_SXM - Low Latency - 1024 / 8192 - 8 - `1k8k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc8.yaml`` * - 4xH200_SXM - Balanced - 1024 / 8192 - 16 - `1k8k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc16.yaml`` * - 4xH200_SXM - High Throughput - 1024 / 8192 - 32 - `1k8k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc32.yaml`` * - 4xH200_SXM - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp4_conc64.yaml`` * - 4xH200_SXM - Min Latency - 8192 / 1024 - 4 - `8k1k_tp4_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc4.yaml`` * - 4xH200_SXM - Low Latency - 8192 / 1024 - 8 - `8k1k_tp4_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc8.yaml`` * - 4xH200_SXM - Balanced - 8192 / 1024 - 16 - `8k1k_tp4_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc16.yaml`` * - 4xH200_SXM - High Throughput - 8192 / 1024 - 32 - `8k1k_tp4_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc32.yaml`` * - 4xH200_SXM - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp4_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp4_conc64.yaml`` * - 8xH200_SXM - Min Latency - 1024 / 1024 - 4 - `1k1k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc4.yaml`` * - 8xH200_SXM - Low Latency - 1024 / 1024 - 8 - `1k1k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc8.yaml`` * - 8xH200_SXM - Balanced - 1024 / 1024 - 16 - `1k1k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc16.yaml`` * - 8xH200_SXM - High Throughput - 1024 / 1024 - 32 - `1k1k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc32.yaml`` * - 8xH200_SXM - Max Throughput - 1024 / 1024 - 64 - `1k1k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k1k_tp8_conc64.yaml`` * - 8xH200_SXM - Min Latency - 1024 / 8192 - 4 - `1k8k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc4.yaml`` * - 8xH200_SXM - Low Latency - 1024 / 8192 - 8 - `1k8k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc8.yaml`` * - 8xH200_SXM - Balanced - 1024 / 8192 - 16 - `1k8k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc16.yaml`` * - 8xH200_SXM - High Throughput - 1024 / 8192 - 32 - `1k8k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc32.yaml`` * - 8xH200_SXM - Max Throughput - 1024 / 8192 - 64 - `1k8k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/1k8k_tp8_conc64.yaml`` * - 8xH200_SXM - Min Latency - 8192 / 1024 - 4 - `8k1k_tp8_conc4.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc4.yaml`` * - 8xH200_SXM - Low Latency - 8192 / 1024 - 8 - `8k1k_tp8_conc8.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc8.yaml`` * - 8xH200_SXM - Balanced - 8192 / 1024 - 16 - `8k1k_tp8_conc16.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc16.yaml`` * - 8xH200_SXM - High Throughput - 8192 / 1024 - 32 - `8k1k_tp8_conc32.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc32.yaml`` * - 8xH200_SXM - Max Throughput - 8192 / 1024 - 64 - `8k1k_tp8_conc64.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/database/openai/gpt-oss-120b/H200/8k1k_tp8_conc64.yaml`` .. end-openai/gpt-oss-120b --- Model Recipes ================ Quick Start for Popular Models ------------------------------- The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more. We maintain LLM API configuration files for these models containing recommended performance settings in two locations: * **Curated Examples**: `examples/configs/curated `_ - Hand-picked configurations for common scenarios. * **Comprehensive Database**: `examples/configs/database `_ - A more comprehensive set of known-good configurations for various GPUs and traffic patterns. The TensorRT LLM Docker container makes these config files available at ``/app/tensorrt_llm/examples/configs/curated`` and ``/app/tensorrt_llm/examples/configs/database`` respectively. You can reference them as needed: .. code-block:: bash export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment .. include:: ../_includes/note_sections.rst :start-after: .. start-note-quick-start-isl-osl :end-before: .. end-note-quick-start-isl-osl This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below. .. list-table:: :header-rows: 1 :widths: 20 15 15 20 30 * - Model Name - GPU - Inference Scenario - Config - Command * - `DeepSeek-R1 `_ - H100, H200 - Max Throughput - `deepseek-r1-throughput.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml`` * - `DeepSeek-R1 `_ - B200, GB200 - Max Throughput - `deepseek-r1-deepgemm.yaml `_ - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml`` * - `DeepSeek-R1 (NVFP4) `_ - B200, GB200 - Max Throughput - `deepseek-r1-throughput.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-FP4 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml`` * - `DeepSeek-R1 (NVFP4) `_ - B200, GB200 - Min Latency - `deepseek-r1-latency.yaml `_ - ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml`` * - `gpt-oss-120b `_ - Any - Max Throughput - `gpt-oss-120b-throughput.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml`` * - `gpt-oss-120b `_ - Any - Min Latency - `gpt-oss-120b-latency.yaml `_ - ``trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml`` * - `Qwen3-Next-80B-A3B-Thinking `_ - Any - Max Throughput - `qwen3-next.yaml `_ - ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --config ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml`` * - Qwen3 family (e.g. `Qwen3-30B-A3B `_) - Any - Max Throughput - `qwen3.yaml `_ - ``trtllm-serve Qwen/Qwen3-30B-A3B --config ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml`` (swap to another Qwen3 model name as needed) * - `Llama-3.3-70B (FP8) `_ - Any - Max Throughput - `llama-3.3-70b.yaml `_ - ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml`` * - `Llama 4 Scout (FP8) `_ - Any - Max Throughput - `llama-4-scout.yaml `_ - ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml`` Model-Specific Deployment Guides --------------------------------- The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM. .. toctree:: :maxdepth: 1 :name: Deployment Guides deployment-guide-for-deepseek-r1-on-trtllm.md deployment-guide-for-llama3.3-70b-on-trtllm.md deployment-guide-for-llama4-scout-on-trtllm.md deployment-guide-for-gpt-oss-on-trtllm.md deployment-guide-for-qwen3-on-trtllm.md deployment-guide-for-qwen3-next-on-trtllm.md deployment-guide-for-kimi-k2-thinking-on-trtllm.md Preconfigured Recipes --------------------- .. _recipe-selector: Recipe selector ^^^^^^^^^^^^^^^ .. trtllm_config_selector:: .. include:: ../_includes/note_sections.rst :start-after: .. start-note-traffic-patterns :end-before: .. end-note-traffic-patterns .. _recipe-database: Recipe database ^^^^^^^^^^^^^^^ The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings. .. include:: config_table.rst :start-after: .. end-config-table-note --- Dynamo K8s Example ================================= This example demonstrates how to deploy TensorRT-LLM on a Kubernetes cluster using Dynamo Cloud. Dynamo provides an operator-based approach to manage the lifecycle of model deployments through Custom Resource Definitions (CRDs). Please see `Dynamo Kubernetes Quick Start Guide `_ for more details. --- ======================================================= LLM Examples Introduction ======================================================= Here is a simple example to show how to use the LLM with TinyLlama. .. literalinclude:: ../../../examples/llm-api/quickstart_example.py :language: python :linenos: The LLM API can be used for both offline or online usage. See more examples of the LLM API here: .. toctree:: :maxdepth: 1 :caption: LLM API Examples %EXAMPLE_DOCS% For more details on how to fully utilize this API, check out: * `Common customizations `_ * `LLM API Reference <../llm-api/index.html>`_ --- .. TensorRT LLM documentation master file, created by sphinx-quickstart on Wed Sep 20 08:35:21 2023. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to TensorRT LLM's Documentation! ======================================== .. toctree:: :maxdepth: 2 :caption: Getting Started :name: Getting Started overview.md quick-start-guide.md installation/index.rst .. toctree:: :maxdepth: 2 :caption: Deployment Guide :name: Deployment Guide examples/llm_api_examples.rst examples/trtllm_serve_examples examples/dynamo_k8s_example.rst deployment-guide/index.rst .. toctree:: :maxdepth: 2 :caption: Models :name: Models models/supported-models.md models/adding-new-model.md .. toctree:: :maxdepth: 2 :caption: CLI Reference :name: CLI Reference commands/trtllm-bench commands/trtllm-eval commands/trtllm-serve/index .. toctree:: :maxdepth: 2 :caption: API Reference llm-api/index.md llm-api/reference.rst .. toctree:: :maxdepth: 2 :caption: Features features/feature-combination-matrix.md features/attention.md features/disagg-serving.md features/kvcache.md features/long-sequence.md features/lora.md features/multi-modality.md features/overlap-scheduler.md features/paged-attention-ifb-scheduler.md features/parallel-strategy.md features/quantization.md features/sampling.md features/additional-outputs.md features/guided-decoding.md features/speculative-decoding.md features/checkpoint-loading.md features/auto_deploy/auto-deploy.md features/ray-orchestrator.md features/torch_compile_and_piecewise_cuda_graph.md features/helix.md features/kv-cache-connector.md .. toctree:: :maxdepth: 2 :caption: Developer Guide developer-guide/overview.md developer-guide/perf-analysis.md developer-guide/perf-benchmarking.md developer-guide/ci-overview.md developer-guide/dev-containers.md developer-guide/api-change.md developer-guide/kv-transfer.md .. toctree:: :maxdepth: 2 :caption: Blogs :glob: blogs/tech_blog/* blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md blogs/H200launch.md blogs/XQA-kernel.md blogs/H100vsA100.md .. toctree:: :maxdepth: 2 :caption: Quick Links Releases Github Code Roadmap .. toctree:: :maxdepth: 2 :caption: Use TensorRT Engine :hidden: legacy/tensorrt_quickstart.md Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` --- .. _installation: Installation ============ There are multiple ways to install and run TensorRT LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features. Note: **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.** 1. :ref:`containers` 2. Pre-built release wheels on `PyPI `_ (see :ref:`linux`) 3. :ref:`build-from-source-linux` .. toctree:: :maxdepth: 1 :caption: Links :hidden: containers linux build-from-source-linux --- Performance Tuning Guide ======================= .. include:: introduction.md :parser: myst_parser.sphinx_ .. toctree:: :maxdepth: 1 benchmarking-default-performance useful-build-time-flags tuning-max-batch-size-and-max-num-tokens deciding-model-sharding-strategy fp8-quantization useful-runtime-flags --- Functionals =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm .. automodule:: tensorrt_llm.functional :members: :undoc-members: :show-inheritance: --- Layers =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm Activation ------------ .. automodule:: tensorrt_llm.layers.activation :members: :undoc-members: :show-inheritance: Attention ------------ .. automodule:: tensorrt_llm.layers.attention :members: :undoc-members: :show-inheritance: Cast ------------ .. automodule:: tensorrt_llm.layers.cast :members: :undoc-members: :show-inheritance: Conv ------------ .. automodule:: tensorrt_llm.layers.conv :members: :undoc-members: :show-inheritance: Embedding ------------ .. automodule:: tensorrt_llm.layers.embedding :members: :undoc-members: :show-inheritance: Linear ------------ .. automodule:: tensorrt_llm.layers.linear :members: :undoc-members: :show-inheritance: MLP ------------ .. automodule:: tensorrt_llm.layers.mlp :members: :undoc-members: :show-inheritance: Normalization --------------- .. automodule:: tensorrt_llm.layers.normalization :members: :undoc-members: :show-inheritance: Pooling ------------ .. automodule:: tensorrt_llm.layers.pooling :members: :undoc-members: :show-inheritance: --- Models =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm .. automodule:: tensorrt_llm.models :members: :undoc-members: :show-inheritance: --- Plugin =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm .. automodule:: tensorrt_llm.plugin :members: :show-inheritance: --- Quantization =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm .. automodule:: tensorrt_llm.quantization :members: :show-inheritance: --- Runtime =========================== .. automodule:: tensorrt_llm .. currentmodule:: tensorrt_llm .. automodule:: tensorrt_llm.runtime :members: :undoc-members: :show-inheritance: --- # How to get best performance on DeepSeek-R1 in TensorRT LLM NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. [NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) In this blog, we share the configurations and procedures about how to reproduce the number on both B200 and H200 with PyTorch workflow. ## Table of Contents - [How to get best performance on DeepSeek-R1 in TensorRT LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm) - [Table of Contents](#table-of-contents) - [Prerequisites: Install TensorRT LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models) - [1. Download TensorRT LLM](#1-download-tensorrt-llm) - [2. Download the DeepSeek R1 models](#2-download-the-deepseek-r1-models) - [3. Build and run TensorRT LLM container](#3-build-and-run-tensorrt-llm-container) - [4. Compile and Install TensorRT LLM](#4-compile-and-install-tensorrt-llm) - [5. Optional: Tune GPU clocks](#5-optional-tune-gpu-clocks) - [6. Dataset preparation](#6-dataset-preparation) - [Reproducing steps](#reproducing-steps) - [B200 min-latency](#b200-min-latency) - [Expected Results](#expected-results) - [B200 max-throughput for R1-0528 with FP8 KV cache](#b200-max-throughput-for-r1-0528-with-fp8-kv-cache) - [Benchmark](#benchmark) - [Expected Result Format](#expected-result-format) - [B200 max-throughput for R1 with FP16 KV cache](#b200-max-throughput-for-r1-with-fp16-kv-cache) - [Benchmark](#benchmark-1) - [Expected Result Format](#expected-result-format-1) - [H200 min-latency](#h200-min-latency) - [Expected Result Format](#expected-result-format-2) - [H200 max-throughput](#h200-max-throughput) - [Expected Result Format](#expected-result-format-3) - [Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations) - [WIP: Enable more features by default](#wip-enable-more-features-by-default) - [MLA chunked context](#mla-chunked-context) - [Out of memory issues](#out-of-memory-issues) ## Prerequisites: Install TensorRT LLM and download models This section can be skipped if you already have TensorRT LLM installed and have already downloaded the DeepSeek R1 model checkpoint. #### 1. Download TensorRT LLM **You can also find more comprehensive instructions to install TensorRT LLM in this [TensorRT LLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.** ``` bash # Prerequisites apt-get update && apt-get -y install git git-lfs git lfs install # Replace with your actual path YOUR_WORK_PATH= # Clone the TensorRT LLM repository cd $YOUR_WORK_PATH git clone https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM git submodule update --init --recursive git lfs pull ``` **Note**: Replace `<*_PATH>` to your actual path. #### 2. Download the DeepSeek R1 models For NVIDIA Blackwell GPUs, it's recommended to use the [FP4 quantized version of DeepSeek R1](https://huggingface.co/nvidia/DeepSeek-R1-FP4) to get the best performance. For NVIDIA Hopper GPUs, it's recommended to use the FP8 version of the DeepSeek R1 model. ```bash # Replace with your actual path YOUR_MODEL_PATH= cd $YOUR_MODEL_PATH ## Download NVFP4 model for Blackwell GPUs git clone https://huggingface.co/nvidia/DeepSeek-R1-NVFP4-v2 ## Or the 0528 version git clone https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2 ## Download FP8 model for Hopper GPUs ## FP8 model also works for Blackwell, but FP4 has the best performance on Blackwell. git clone https://huggingface.co/deepseek-ai/DeepSeek-R1 ``` #### 3. Build and run TensorRT LLM container ``` bash cd TensorRT-LLM make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH" ``` Here we set `LOCAL_USER=1` argument to set up the local user instead of root account inside the container, you can remove it if running as root inside container is fine. #### 4. Compile and Install TensorRT LLM Here we compile the source inside the container: ``` bash python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "90-real;100-real" --python_bindings --clean ``` You can set the cuda_architectures to "100-real" if targeting Blackwell only, and "90-real" to target Hopper only to save some build time. Install and set environment variables: ```bash pip install --user build/tensorrt_llm*.whl export PATH=${HOME}/.local/bin:${PATH} export PYTHONPATH=`pwd` ``` #### 5. Optional: Tune GPU clocks ``` sudo nvidia-smi -pm 0; sudo nvidia-smi -pm 1; sudo nvidia-smi boost-slider --vboost 4 ``` The boost-slider option will tune the GPU clock and can get you slight perf increase, for B200 min-latency scenarios it's about 8 TPS/USER. This is not a required step, it's provided here to make sure the perf numbers in this doc can be reproduced more closely to our internal run. #### 6. Dataset preparation The trtllm-bench tool requires a dataset file to read prompts and output sequence length of each prompt. Format details of this dataset file can be seen in [preparing a dataset]( https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html#preparing-a-dataset). For min-latency benchmarking, **real dataset is required** since the MTP accept rate is affected by the dataset thus affecting the performance. You can use your own dataset following the format described in the link above. For the max-throughput benchmarking, synthetic dataset is enough to be representative, since it does not use MTP. The command to generate synthetic dataset will be attached to the max throughput section. ## Reproducing steps This section provides the reproducing steps for NVIDIA Blackwell B200 and H200 GPUs, for both min-latency and max-throughput scenarios. All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT LLM installation, see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool. For brevity, we only provide the commands to reproduce the perf numbers without detailed explanation of the tools and options in this doc. All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRT LLM container section](#3-build-and-run-tensorrt-llm-container) ### B200 min-latency Our benchmark results are based on **Batch = 1, ISL = 1K, OSL = 2K, num_requests = 10 from real dataset** To do the benchmark, run the following command: ```bash YOUR_DATA_PATH= cat >./config.yml<./config.yml <./config.yml < cat >./config.yml<./config.yml< Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. TensorRT LLM v0.7a | Falcon-180B | 1xH200 TP1 | INT4 AWQ | BS: (in order) 256, 128 **Model Accuracy:** Often quantization can have adverse impacts on the accuracy of the model, however, TensorRT LLM's AWQ decreases memory footprint of the model by **4x** while maintaining high accuracy. Falcon-180B accuracy comparison Preliminary measured accuracy, subject to change. TensorRT LLM v0.7a | Falcon-180B | 1xH200 TP1 | INT4 AWQ [**INT4 Activation-aware Weight Quantization (AWQ)**](https://arxiv.org/abs/2306.00978) (Lin et al., 2023) is a quantization technique which compresses the weights of an LLM down to 4bits based on their relative importance, and performs computation in FP16. This allows for AWQ to retain higher accuracy than other 4bit methods and reduce memory usage, but requires special kernels capable of handling the change in precision performantly. TensorRT LLM has implemented custom kernels for AWQ, and taken the technique a step further by performing FP8 computation on Hopper GPUs instead of the standard FP16. Similar examples running Falcon-180B with quantization in TensorRT LLM are available in [examples/models/contrib/falcon](/examples/models/contrib/falcon). ## Llama-70B on H200 up to 6.7x A100 TensorRT LLM has improved its Group Query Attention (GQA) kernels, in the generation phase, providing up to 2.4x improvement on Llama-70B over TensorRT LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster than A100. **H200 6.7x A100** Llama-70B H200 vs A100 comparison |Model |GPUs | Input Length | Output Length | Throughput (out tok/s/GPU)| |:---------|:----|:-------------|:--------------|:------| |Llama-70B | 1 | 128| 128 | 3,803 | | | 8 | | | 3,803 | | | 1 | | 2048 | 2,941 | | | 8 | | | 3,163 | | | 1 | | 4096 | 1,946 | | | 8 | | | 2,263 | Preliminary measured performance, subject to change. TensorRT LLM v0.7a | Llama2-70B | 1xH200 = TP1, 8xH200 = max TP/PP/DP config | FP8 | BS: (in order) 960, 960, 192, 560, 96, 640 **TensorRT LLM GQA now 2.4x faster on H200** Llama-70B H200 December vs Oct. Preliminary measured performance, subject to change. TensorRT LLM v0.7a vs TensorRT LLM v0.6a | Llama2-70B | 1xH200 TP1 | FP8 | BS 192 [**Grouped Query Attention (GQA)**](https://arxiv.org/abs/2305.13245v2) (Ainslie et al., 2023), used in Llama-70B, is a variant of Multihead Attention (MHA) which groups key-value (KV) heads together, resulting in fewer KV heads than query (Q) heads. TensorRT LLM has a custom implementation of MHA which supports GQA, multi-query attention (MQA) and standard MHA. It leverages Tensor Cores, including in the generation phase, and delivers great performance on NVIDIA GPUs. ###### Closing These improvements will be published in the `main` branch soon, and will be included in the v0.7 & v0.8 releases. Similar examples running Llama-70B in TensorRT LLM are published in [examples/models/core/llama](/examples/models/core/llama). For more information about H200, please see the [H200 announcement blog](./H200launch.md). Throughput is calculated as output tokens per second per gpu. `out_tps=output_seqlen*batch_size/total_latency/tp` **Glossary:** | DP = Data Parallel ISL = Input Sequence Length | PP = Pipeline Parallel | OSL = Output Sequence Length | OOM = Out of Memory | TP = Tensor Parallel --- > :bangbang: :new: *NVIDIA H200 has been announced & is optimized on TensorRT LLM. Learn more about H200, & H100 comparison, here:* [**H200** achieves nearly **12,000 tokens/sec on Llama2-13B** with TensorRT LLM](./H200launch.md) # H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For min-latency applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency. max throughput 1st token latency TensorRT LLM throughput & first token latency on H100 & A100. H100 FP8, A100 FP16, SXM 80GB GPUs, ISL/OSL's provided, TP=1, BS=32/64 max throughput, BS=1 1st token latency. TensorRT LLM v0.5.0, TensorRT 9.1. Max throughput calculated by sweeping BS 1,2,...,64. Throughput taken at largest successful. **Max Throughput & Min Latency** | Model | Batch Size | Input Length | Output Length | Throughput (out tok/s) | 1st Token Latency (ms) | | :--------------------------- | :--------- | :----------- | :------------ | ---------------------: | ---------------------: | | **H100** | GPT-J 6B | 64 | 128 | 128 | **10,907** | 102 | | GPT-J 6B | 1 | 128 | - | 185 | **7.1** | | **A100** | | GPT-J 6B | 64 | 128 | 128 | 3,679 | 481 | | GPT-J 6B | 1 | 128 | - | 111 | 12.5 | | **Speedup** | | GPT-J 6B | 64 | 128 | 128 | **3.0x** | **4.7x** | | GPT-J 6B | 1 | 128 | - | **2.4x** | 1.7x | FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1 The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html) Stay tuned for a highlight on Llama coming soon! ## MLPerf on H100 with FP8 In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed. ## What is H100 FP8? H100 is NVIDIA's next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100. FP8 specification introduced in the paper [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433) can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors. In practice, FP8 can improve perceived performance on H100 (FP8 vs FP16) by more than 2x. FP8 is a W8A8 format, meaning the weights are stored in 8bit, as are the activations, or compute. 8bit weights decrease GPU memory consumption & bandwidth meaning a larger model, sequence length, or batchsize can be fit into the same GPU. This can enable new use cases, and larger max batch size can increase max throughput beyond 2x of FP16 H100. --- :loudspeaker: Note: The below data is using TensorRT LLM v0.5. There have been significant improvements in v0.6 & later. Please see updated Llama performance [here](./Falcon180B-H200.md). # H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform) achieves **11,819 tokens/s on Llama2-13B** on a single GPU. H200 is up to **1.9x faster** than H100. This performance is enabled by H200's larger, faster [HBM3e memory](#latest-hbm-memory). **H200 FP8 Max throughput** |Model | Batch Size(1) | TP(2) | Input Length | Output Length | Throughput (out tok/s/GPU) | |:----------|:-------------------------|:-----------------|:-------------|:--------------|---------------------------:| | llama_13b | 1024 | 1 | 128 | 128 | 11,819 | | llama_13b | 128 | 1 | 128 | 2048 | 4,750 | | llama_13b | 64 | 1 | 2048 | 128 | 1,349 | | llama_70b | 512 | 1 | 128 | 128 | 3,014 | | llama_70b | 512 | 2 | 128 | 2048 | 1,654 | | llama_70b | 64 | 1 | 2048 | 128 | 341 | | llama_70b | 32 | 1 | 2048 | 128 | 303 | Preliminary measured performance, subject to change. TensorRT LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. *(1) Largest batch supported on given TP configuration by power of 2.* *(2) TP = Tensor Parallelism* Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html). ### H200 vs H100 H200's HBM3e larger capacity & faster memory enables up to **1.9x** performance on LLMs compared to H100. Max throughput improves due to its dependence on memory capacity and bandwidth, benefitting from the new HBM3e. First token latency is compute bound for most ISLs, meaning H200 retains similar time to first token as H100. For practical examples of H200's performance: **Max Throughput TP1:** an offline summarization scenario (ISL/OSL=2048/128) with Llama-70B on a single H200 is 1.9x more performant than H100. **Max Throughput TP8:** an online chat agent scenario (ISL/OSL=80/200) with GPT3-175B on a full HGX (TP8) H200 is 1.6x more performant than H100. H200 TPS Preliminary measured performance, subject to change. TensorRT LLM v0.5.0, TensorRT v9.1.0.4. | Llama-70B: H100 FP8 BS 8, H200 FP8 BS 32 | GPT3-175B: H100 FP8 BS 64, H200 FP8 BS 128 **Max Throughput across TP/BS:** Max throughput(3) on H200 vs H100 varies by model, sequence lengths, BS, and TP. Below results shown for maximum throughput per GPU across all these variables. max throughput llama sweep Preliminary measured performance, subject to change. TensorRT LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. *(3) Max Throughput per GPU is defined as the highest tok/s per GPU, swept across TP configurations & BS powers of 2.* ### Latest HBM Memory H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT LLM. --- # New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget XQA kernel provides optimization for [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/abs/2305.13245v3) during generation phase. It also provides optimization for beam search. Using tensor cores for acceleration, reducing data loading and conversion, it delivers increased throughput within the same latency budget. Increased throughput allows serving greater number of user requests while providing the same experience. Support matrix and usage flags are described in [docs/source/advanced/gpt_attention](/docs/source/advanced/gpt-attention.md#xqa-optimization). **Increased Throughput:** Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled. XQA increased throughput within same latency budget Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a ## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget **H200 2.4x with XQA** |Model |GPUs | Input Length | Output Length | Throughput w/o XQA (tok/s/GPU) | Throughput w/ XQA (tok/s/GPU) | Speedup | |:---------|:----|:-------------|:--------------|:-------------------|:------------------|:--------| |Llama-70B | 1 | 128 | 2048 | 1,227 | 2,941 | 2.4x | | 8 | 128 | 2048 | 13,232 | 25,300 | 1.9x ###### Closing These improvements will be published in the `main` branch soon, and will be included in the v0.8 releases. For more information about H200, please see the [H200 announcement blog](./H200launch.md). Throughput is calculated as output tokens per second per gpu. `out_tps=output_seqlen*batch_size/total_latency/tp` **Glossary:** | DP = Data Parallel ISL = Input Sequence Length | PP = Pipeline Parallel | OSL = Output Sequence Length | OOM = Out of Memory | TP = Tensor Parallel --- # Speed up inference with SOTA quantization techniques in TRT-LLM The deployment and inference speed of LLMs are often impeded by limitations in memory capacity, memory bandwidth, and computation power. Quantization emerges as a vital strategy to address these bottlenecks, involving representing weights and activations with lower-precision data types like [FP8](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52166/). In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case. ## Quantization in TensorRT-LLM TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future. ## Benchmark ### Performance In the following benchmark, we highlight the acceleration of a few popular models at a small batch size without imposing latency constraints. It's important to note that in scenarios where there's a latency constraint in your application, TRT-LLM can achieve an even greater performance improvement. Using LLaMA-v2-7B as an example, when the first token latency is constrained to be under 500ms, quantization with FP8 and a batch size of 16 achieves a notable **2.3x inference speedup** compared to FP16 on a H100. | Model | Batch Size | Speedup (FP8 v.s. FP16) | Speedup (INT8 SQ v.s. FP16) | | ----------- | :--------: | :---------------------: | :-------------------------: | | GPT-J | 1 | 1.40x | 1.40x | | GPT-J | 8 | 1.44x | 1.30x | | LLaMA-v2-7B | 1 | 1.51x | 1.47x | | LLaMA-v2-7B | 8 | 1.40x | 1.32x | *The above benchmarks were run with Input Length=1024, Output Length=128, and TP=1 on H100 80GB. ### Accuracy | Model | Quantization Methods | MMLU Baseline (FP16) | MMLU Post-quantization | MMLU Loss | | ------------ | :------------------: | :------------------: | :--------------------: | :-------: | | Falcon-180B | FP8 | 70.4 | 70.3 | 0.14% | | | INT8-SQ | 70.4 | 68.6 | 2.56% | | | INT4-AWQ | 70.4 | 69.8 | 0.85% | | Falcon-40B | FP8 | 56.1 | 55.6 | 0.89% | | | INT8-SQ | 56.1 | 54.7 | 2.50% | | | INT4-AWQ | 56.1 | 55.5 | 1.07% | | LLaMA-v2-70B | FP8 | 69.1 | 68.5 | 0.87% | | | INT8-SQ | 69.1 | 67.2 | 2.75% | | | INT4-AWQ | 69.1 | 68.4 | 1.01% | | MPT-30B | FP8 | 47.5 | 47.4 | 0.21% | | | INT8-SQ | 47.5 | 46.8 | 1.47% | | | INT4-AWQ | 47.5 | 46.5 | 2.11% | ## Best practices to choose the right quantization methods A quantization method comprises three primary components: 1. Weight precision format 2. Activation precision format 3. Calibration algorithms Typically, in the context of small-batch inference scenarios (batch size ≤ 4), the key consideration is memory bandwidth, making weight-only quantization methods the preferred choice. Conversely, for large-batch inference scenarios, such as serving scenarios (batch size ≥ 16), both memory bandwidth and computation density become crucial factors. Consequently, it's recommended to opt for a quantization method that has both weight and activation quantized. For batch size ≥ 16, the choice of quantization method can be model specific. We suggest to prioritize using FP8 first, as we typically see it offers the best performance and accuracy. If the results do not meet your specific use case, you can further experiment with Int8 SmoothQuant (Int8 SQ) followed by AWQ and/or GPTQ. Based on specific use cases, users might have different tolerances on accuracy impact and calibration time. The table below summarizes the tradeoffs* to consider when choosing a quantization method. You can also learn more about precision formats in our [documentation](https://nvidia.github.io/TensorRT-LLM/reference/precision.html). | Quantization Methods | Performance Improvement (batch size <= 4) | Performance Improvement (batch size >= 16) | Accuracy Impact | Calibration Time** | | :----------------------- | :---------------------------------------: | :----------------------------------------: | :-------------: | :----------------: | | FP8 (W8A8) | Medium | Medium | Very Low | Minutes | | Int8 SQ (W8A8) | Medium | Medium | Medium | Minutes | | Int8 weight-only (W8A16) | Medium | Low | Low | Not Required | | Int4 weight-only (W4A16) | High | Low | High | Not Required | | Int4 AWQ (W4A16) | High | Low | Low | Tens of Minutes | | Int4 GPTQ | High | Low | Low | Tens of Minutes | | Int4-FP8 AWQ (W4A8) | High | Medium | Low | Tens of Minutes | \* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points. ** Calibration time is subject to the actual model size. We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit. ## What’s coming next TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases. --- # ADP Balance Strategy By NVIDIA TensorRT LLM team ## Table of Contents - [ADP Balance Strategy](#adp-balance-strategy) - [Table of Contents](#table-of-contents) - [Motivation and Background](#motivation-and-background) - [Theoretical Analysis and Modeling](#theoretical-analysis-and-modeling) - [Mathematical Modeling](#mathematical-modeling) - [Scheduling Strategies for Load Balancing](#scheduling-strategies-for-load-balancing) - [Baseline: Round-Robin Token Distribution](#baseline-round-robin-token-distribution) - [ADP Balance Strategy: Coordinated Waiting Mechanism](#adp-balance-strategy-coordinated-waiting-mechanism) - [Performance Analysis: Baseline vs. ADP Balance](#performance-analysis-baseline-vs-adp-balance) - [Experiments](#experiments) - [Setting](#setting) - [Dataset Configuration](#dataset-configuration) - [Hardware and Model Configuration](#hardware-and-model-configuration) - [Performance Results](#performance-results) - [Performance Summary](#performance-summary) - [Baseline Performance](#baseline-performance) - [ADP Balance with Context Wait Implementation](#adp-balance-with-context-wait-implementation) - [ADP Balance with Full Strategy Implementation](#adp-balance-with-full-strategy-implementation) - [Pareto Analysis: Throughput-Latency Trade-off Optimization](#pareto-analysis-throughput-latency-trade-off-optimization) - [Conclusion](#conclusion) - [Acknowledgement](#acknowledgement) ## Motivation and Background In DeepSeek MLA + MoE architectures under maximum-throughput scenarios, an Attention Data Parallel (ADP) + MoE Expert Parallel (EP) strategy is commonly employed to eliminate redundant KV cache storage, and utilize disaggregated serving to prevent ADP imbalances. However, certain deployment scenarios still favor In-Flight Batching (IFB) inference, including: - **System complexity reduction**: Avoiding the operational overhead and maintenance costs associated with disaggregated architectures - **Specific workload patterns**: Scenarios with short input sequence lengths (ISL) and long output sequence lengths (OSL) - **Offline inference**: Batch processing environments where Time-To-First-Token (TTFT) and Time-To-Output-Token (TPOT) requirements are more relaxed However, IFB introduces significant load imbalance challenges within the Attention module that severely impact system performance. The core issue arises when different ranks simultaneously handle heterogeneous workloads within the same iteration. For instance, some ranks may be processing computationally intensive context phases while others execute generation phases, creating substantial disparities in token processing loads. This bottlenecks the overall system throughput, as the iteration time is dominated by the slowest rank. To address this critical performance limitation, we introduce the **ADP (Attention Data Parallel) Balance Strategy**—a novel scheduling optimization designed to achieve optimal load distribution across DP ranks and maximize system utilization. ## Theoretical Analysis and Modeling **Optimization Objective**: Minimize load imbalance across different GPU ranks to maximize overall system throughput. ### Mathematical Modeling We model and quantify the performance impact of load imbalance in Attention DP. Since workloads across ranks can be heterogeneous, the execution time for the Attention module in any given iteration is bounded by the rank with the highest workload: $$ time_i = \max_{0 \leq m < N} time_{i,m} $$ where $time_{i,m}$ represents the execution time of rank $m$ in iteration $i$, and $N$ is the data parallel size. To quantify load balance and theoretical performance bounds, we define two key metrics: #### 1. Balance Ratio The balance ratio measures the load distribution across ranks within the Attention module for each iteration: $$ balance = \frac{tokens_{avg}}{tokens_{max}} $$ where: - $tokens_{avg}$ represents the average number of tokens across all ranks - $tokens_{max}$ represents the maximum number of tokens across all ranks - $tokens_i$ represents the number of tokens processed by rank $i$ Note: MoE module load balancing is handled separately by the Expert Parallel Load Balancer (EPLB) module and is not considered during the early scheduling phase. #### 2. Speed-of-Light Throughput (SOL TPS) The Speed-of-Light throughput represents the theoretical upper-bound throughput achievable with perfect load balancing: $$ time_{sol} = \sum_{i=0}^{\infty} time_i \times balance $$ $$ tps_{sol} = \frac{time_{elapsed}}{time_{sol}} \times tps_{actual} $$ where: - $time_i$: Measured execution time of iteration $i$ - $time_{elapsed}$: Total empirically measured end-to-end execution time - $tps_{actual}$: Observed throughput in tokens per second - $tps_{sol}$: Theoretical maximum throughput under perfect load balance This theoretical framework enables us to quantify the performance gap between current and optimal system utilization, providing clear targets for optimization. ### Scheduling Strategies for Load Balancing The fundamental challenge in Attention DP is that ranks can process vastly different token loads within the same iteration, causing the overall execution time to be bottlenecked by the most heavily loaded rank. #### Baseline: Round-Robin Token Distribution The conventional approach employs a global load balancing strategy that sorts incoming requests by `num_tokens` and distributes them across ranks using round-robin scheduling, as illustrated in Figure 1. This method achieves reasonable token distribution from a cumulative perspective and effectively reduces token count disparities when all ranks are simultaneously processing context requests.

Figure 1: Baseline round-robin strategy balances context request tokens across ranks through sorting and cyclic distribution

**Limitations**: While effective globally, this approach fails to guarantee per-iteration load balance. A critical scenario arises when some ranks process context phases, while others handle generation (decode), creating severe load imbalances that dominate overall execution time. #### ADP Balance Strategy: Coordinated Waiting Mechanism To address the per-iteration load imbalance problem, we propose the **ADP Balance Strategy**, which employs a sophisticated waiting mechanism to synchronize context processing across ranks. The core principle is strategic delay: instead of immediately scheduling context requests to available ranks, the system waits strategically to ensure multiple ranks have similar workloads before proceeding. **Algorithm Design**: The strategy introduces two complementary control parameters: **1. Context Synchronization (`timeout_iters`)** - **Purpose**: Ensures temporal alignment of context processing across ranks - **Mechanism**: When a rank becomes available for context processing while others remain in generation phases, it waits up to `timeout_iters` iterations until all other ranks have context tasks - **Benefit**: Prevents the scenario where one rank processes context tasks while others handle generation tasks **2. Batch Equilibration (`batching_wait_iters`)** - **Purpose**: Balances the number of accumulated context batches across ranks - **Mechanism**: After initial synchronization, ranks with fewer accumulated context batches wait up to `batching_wait_iters` additional iterations to accumulate more batches - **Benefit**: Prevents load imbalances caused by uneven context batch accumulation, where some ranks may have multiple batches while others have only one ### Performance Analysis: Baseline vs. ADP Balance To illustrate the effectiveness of our approach, consider a simplified scenario where: - All ranks have equal-length contexts and M ongoing requests - N new requests arrive sequentially over N iterations. - Context processing time: `time(ctx)` >> Generation processing time: `time(gen)` **Baseline Behavior:** In the traditional approach, contexts are processed sequentially across ranks, resulting in severe load imbalances: ```text iter_i: [*C0*, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM] iter_i+1: [g00, g01, ..., g0M], [*C1*, g11, ..., g1M], ..., [gN0, gN1, ..., gNM] ... iter_i+N-1: [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [*CN*, gN1, ..., gNM] ``` *Legend: `*Ci*` = context request i, `gij` = generation request j on rank i* - **Per-iteration time**: `time(ctx)` (dominated by context processing) - **Total execution time**: `time(ctx) × N` - **Balance ratio**: `(ctx_len + (M-1) + M × (N-1)) / (N × ctx_len)` (poor balance) **ADP Balance Strategy:** Our method synchronizes context processing by strategic waiting: ```text iter_i: [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM] iter_i+1: [g00, g01, ..., g0M], [g10, g11, ..., g1M], ..., [gN0, gN1, ..., gNM] ... iter_i+N-1: [*C0*, g01, ..., g0M], [*C1*, g11, ..., g1M], ..., [*CN*, gN1, ..., gNM] ``` - **Per-iteration time**: `time(gen)` for first N-1 iterations, `time(ctx)` for final iteration - **Total execution time**: `time(gen) × (N-1) + time(ctx)` - **Balance ratio**: 1.0 (perfect balance) - **Time savings**: `(time(ctx) - time(gen)) × (N-1)` **Trade-offs:** - ✅ **Throughput improvement** due to optimal load balancing - ✅ **Maximized GPU utilization** across all ranks - ⚠️ **Increased TTFT** due to strategic waiting mechanism - 📋 **Best suited for** throughput-oriented scenarios where TTFT is not critical ## Experiments ### Setting #### Dataset Configuration We evaluate our approach using a comprehensive dataset comprising 16,000 inference requests with the following characteristics: - **Request volume**: 16,000 total requests - **Average input length**: 803 tokens - **Average output length**: 3,653 tokens - **Token distribution**: Figure 2 illustrates the distribution patterns for both input and output sequences

Figure 2: Distribution of input and output token lengths

**Dataset Characteristics**: The dataset exhibits significant diversity in sequence lengths, with output tokens following a pronounced long-tail distribution. This heterogeneity presents substantial challenges for load balancing, as it becomes difficult to co-schedule multiple context requests within the same iteration while minimizing computational bubbles—making it an ideal testbed for evaluating our scheduling strategy. #### Hardware and Model Configuration **Infrastructure**: - **Platform**: NVIDIA Blackwell GB200 system - **GPU Count**: 8 × GB200 GPUs - **Model**: DeepSeek V3 - **Parallelization Strategy**: - Attention module: Data Parallel (DP) size = 8 - MoE module: Expert Parallel (EP) size = 8 ### Performance Results We evaluate three distinct configurations to demonstrate the progressive benefits of our ADP Balance strategy: 1. **Baseline**: Round-robin scheduling 2. **ADP Balance (Context Wait)**: Implementing `timeout_iters` parameter only 3. **ADP Balance (Full Strategy)**: Complete implementation with both `timeout_iters` and `batching_wait_iters` #### Performance Summary | Configuration | Actual TPS | Avg Balance Ratio | SOL TPS | Speedup | |---------------|------------|-------------------|-------------------|---------| | Baseline | 25,664 | 54.11% | 39,552 | 1.00× | | ADP Balance (Context Wait) | 33,499 | 84.33% | 38,312 | 1.31× | | ADP Balance (Full Strategy) | 34,140 | 87.70% | 37,912 | 1.33× | **Key Observations**: - Context Wait alone delivers a substantial **31% throughput improvement** - Full strategy achieves **33% total speedup** with near-optimal balance (87.70%) - Balance ratio improvement: **54% → 87%** represents a dramatic reduction in load imbalance *Note: The decrease in SOL TPS with waiting mechanisms occurs because the strategic delays in context scheduling increase the total number of iterations required to complete all requests. Since SOL TPS calculation only accounts for load imbalance effects within each iteration, it doesn't reflect the iteration count increase caused by delayed context entry, leading to an apparent reduction despite overall system efficiency improvements.* #### Baseline Performance Figure 3 provides comprehensive insight into baseline system behavior, displaying both average tokens across ranks (top) and the corresponding balance ratio (bottom) by iteration. The balance ratio serves as a key indicator: values approaching 1.0 represent optimal balance, while values near 0.0 indicate severe imbalances.

Figure 3: Baseline performance overview showing token distribution and balance ratios across all iterations

**Critical Insights**: - **Imbalance window**: Most severe imbalances occur within the first 12,000 iterations, as evidenced by the average token distribution showing that all context processing phases occur within this critical interval - **Performance gap**: SOL TPS of 39,552 vs. actual TPS of 25,664 reveals a **54% relative performance gap** - **System behavior**: After iteration 12,000, all requests transition to generation phase, naturally reducing imbalances Figure 4 zooms into the critical imbalance period [100-12,000], revealing the dramatic instability in load distribution:

Figure 4: Detailed baseline analysis for iterations 100-12,000 showing severe balance fluctuations

**Performance Bottlenecks**: - Balance ratio frequently drops to **0.4 or lower**, indicating 60%+ load imbalance - Theoretical improvement potential of **70.23%** within the critical window - Extreme volatility in load distribution creates unpredictable performance characteristics #### ADP Balance with Context Wait Implementation The Context Wait mechanism (`timeout_iters=50`) demonstrates the effectiveness of our first optimization component, achieving substantial performance improvements through context synchronization. **Performance Achievements**: - **Throughput**: 33,499 TPS (1.31× speedup) - **Balance improvement**: 84.33% average (vs. 54.11% baseline) - **Efficiency**: Actual TPS significantly closer to theoretical SOL TPS (38,312)

Figure 5: Context Wait performance showing improved balance stability for iterations 100-12,000

**Remaining Challenges**: Despite significant improvements, residual imbalances persist due to: 1. **Timeout scenarios**: Some ranks exceed the waiting threshold when context requests don't arrive uniformly 2. **Batch accumulation disparity**: Longer-waiting ranks accumulate multiple context batches while recently-freed ranks process single batches 3. **Partial synchronization**: While initial synchronization succeeds, subsequent load variations still occur This analysis motivated the development of our second optimization component: batch equilibration. #### ADP Balance with Full Strategy Implementation The complete ADP Balance strategy combines both context synchronization and batch equilibration mechanisms, delivering optimal load balancing performance. **Configuration**: `timeout_iters=50` + `batching_wait_iters=10` **Performance Optimization Results**: - **Peak throughput**: 34,140 TPS (1.33× speedup) - **Optimal balance**: 87.70% average balance ratio - **Near-theoretical efficiency**: Actual TPS (34,140) approaches SOL TPS (37,912) - **System stability**: Dramatically reduced load variance across iterations The effectiveness of our complete ADP Balance implementation is clearly demonstrated in Figure 6. The visualization reveals how the combination of context synchronization and batch equilibration mechanisms achieves near-optimal load balancing throughout the critical execution window.

Figure 6: Full ADP Balance strategy demonstrating superior balance stability for iterations 100-12,000

**Key Improvements Over Context Wait**: - **Enhanced stability**: Balance ratio maintains consistently higher values with reduced volatility - **Residual imbalance mitigation**: Batch equilibration addresses the remaining load disparities - **System predictability**: More uniform performance characteristics across iterations **Implementation Trade-offs**: - ✅ **Maximum throughput improvement**: 33% gain over baseline - ✅ **Near-optimal load balancing**: 87.70% average balance ratio - ⚠️ **Iteration overhead**: Waiting mechanisms increase total iteration count - ⚠️ **TTFT impact**: Strategic delays affect time-to-first-token metrics **Production Configuration**: Users can enable the full ADP Balance strategy by adding the following configuration: ```yaml attention_dp_config: enable_balance: true batching_wait_iters: 10 timeout_iters: 50 ``` ### Pareto Analysis: Throughput-Latency Trade-off Optimization Understanding the performance trade-offs inherent in our ADP Balance strategy is crucial for production deployment decisions. Figure 7 presents a comprehensive Pareto frontier analysis that maps the relationship between system throughput (TPS per GPU) and Time-To-First-Token (TTFT) across varying workload intensities and parameter configurations. **Experimental Design**: The analysis evaluates multiple configurations of `timeout_iters` (TO) and `batching_wait_iters` (BW) parameters under different system load conditions, revealing how parameter tuning affects the fundamental throughput-latency trade-off.

Figure 7: Pareto frontier analysis showing throughput-latency trade-offs across different ADP Balance configurations

**Key Findings**: 1. **Universal Throughput Gains**: ADP Balance consistently delivers superior TPS/GPU performance across the entire operational spectrum, from latency-sensitive to throughput-maximized deployments 2. **Scalability Benefits**: Performance improvements become increasingly pronounced under higher system loads, where load imbalance penalties are most severe 3. **TTFT Trade-off**: Throughput gains necessitate increased first-token latency due to the strategic waiting mechanisms, with higher parameter values yielding greater throughput at the cost of longer response initiation 4. **Configuration Guidance**: - **Low-load scenarios**: `batching_wait_iters` provides minimal benefit while adding latency overhead - **High-throughput scenarios**: Both parameters contribute significantly to performance optimization - **Balanced deployments**: Moderate parameter values offer optimal throughput-latency balance **Production Implications**: This analysis empowers system operators to make data-driven configuration decisions based on specific deployment requirements—whether optimizing for minimum response latency or maximum system throughput. ## Conclusion Load imbalance in Attention Data Parallel processing represents a fundamental bottleneck in large language model inference systems, particularly under In-Flight Batching scenarios where heterogeneous workloads create severe performance penalties. This work introduces the **ADP Balance Strategy**—a sophisticated scheduling optimization that addresses this critical challenge through coordinated waiting mechanisms. **Technical Contributions**: Our approach employs two complementary optimization components: context synchronization (`timeout_iters`) and batch equilibration (`batching_wait_iters`). These mechanisms work in concert to achieve temporal alignment of computationally intensive context processing across data parallel ranks, effectively eliminating the performance bottlenecks caused by rank-level load imbalances. **Experimental Validation**: Comprehensive evaluation on the DeepSeek V3 architecture demonstrates compelling performance improvements: - **33% throughput increase**: From 25,664 to 34,140 TPS - **87% load balance achievement**: Dramatic improvement from 54% baseline - **Near-theoretical efficiency**: Actual performance approaching speed-of-light throughput bounds **Production Readiness**: The Pareto frontier analysis provides critical insights for real-world deployment, revealing that while the strategy introduces TTFT trade-offs, it consistently delivers superior throughput across diverse operational scenarios. The configurable parameter framework enables operators to optimize for their specific performance requirements, whether prioritizing response latency or system throughput. ## Acknowledgement The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications. --- ## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM) This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content. ### Prerequisites - NVIDIA GB200 or B200 GPUs (example below assumes 8 GPUs; adjust flags for your setup) - Fast SSD storage for model weights - Base model weights available under a directory named `gpt-oss-120b` (example path) - Eagle3 speculative model assets available under a directory named `eagle` Expected directory layout on the host (example): ``` /path/to/models/ ├─ gpt-oss-120b/ # base model directory └─ eagle/ # Eagle3 speculative decoding assets ``` ### Get the TensorRT LLM Container (1.1.0rc0) If required by your environment, log into NGC and pull the image: ```bash # Create an API key at https://ngc.nvidia.com (if you don't have one) docker login nvcr.io # Username: $oauthtoken # Password: docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 ``` ### Start the TensorRT LLM Container Run the container and bind-mount your models directory to `/config/models` inside the container: ```bash docker run --rm --ipc=host -it \ --ulimit stack=67108864 \ --ulimit memlock=-1 \ --gpus all \ -p 8000:8000 \ -v /path/to/models:/config/models:rw \ nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ /bin/bash ``` Replace `/path/to/models` with the absolute path on your host. ### Download the models (Base + Eagle3) Inside the container, download the base model and the Eagle3 speculative model to the expected directories under `/config/models/`: ```bash # Optional: authenticate if the repository requires it # export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential pip install -q "huggingface_hub[cli]" # Base model: openai/gpt-oss-120b huggingface-cli download openai/gpt-oss-120b \ --local-dir /config/models/gpt-oss-120b \ --repo-type model # Eagle3 model assets mkdir -p /config/models/eagle huggingface-cli download nvidia/gpt-oss-120b-Eagle3 \ --local-dir /config/models/eagle \ --repo-type model ``` References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3` ### Create the Eagle3 Configuration Inside the container, create the YAML file at `/config/models/eagle/eagle.yaml` with the following content: ```bash mkdir -p /config/models/eagle cat > /config/models/eagle/eagle.yaml << 'EOF' trust_remote_code: true kv_cache_config: enable_block_reuse: false free_gpu_memory_fraction: 0.8 speculative_config: decoding_type: Eagle max_draft_len: 3 speculative_model_dir: /config/models/eagle/ cuda_graph_config: max_batch_size: 10 use_torch_sampler: true moe_config: backend: TRTLLM EOF ``` Notes: - Ensure your base model directory is `/config/models/gpt-oss-120b`. - Ensure your Eagle3 assets are present under `/config/models/eagle/`. - If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`. ### Launch the Server (Eagle3 Speculative Decoding) Run the following command inside the container to start the endpoint: ```bash TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --config /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072 ``` The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000. ### Quick Health Check From another terminal on the host, verify that the server is healthy: ```bash curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health" ``` When `Status: 200` is returned, the endpoint is ready to serve requests. ### Sample Chat Completions Request Note: This Eagle3 + TensorRT LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`. Send a simple OpenAI-compatible Chat Completions request to the running server: ```bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss-120b", "messages": [ {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."} ], "max_tokens": 128, "stream": false }' ``` --- # Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly *By NVIDIA TensorRT LLM Team and the XGrammar Team* ## Table of Contents - [Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly](#combining-guided-decoding-and-speculative-decoding-making-cpu-and-gpu-cooperate-seamlessly) - [Table of Contents](#table-of-contents) - [Background and Challenges](#background-and-challenges) - [Motivation](#motivation) - [Guided Decoding](#guided-decoding) - [Speculative Decoding](#speculative-decoding) - [Two Challenges](#two-challenges) - [Trace Grammar State for Draft Token Proposal and Rejection](#trace-grammar-state-for-draft-token-proposal-and-rejection) - [Target Model](#target-model) - [Draft Model](#draft-model) - [Make Grammar Computation Capturable by CUDA Graph](#make-grammar-computation-capturable-by-cuda-graph) - [CUDA Callback](#cuda-callback) - [Integration to TensorRT LLM Python Runtime](#integration-to-tensorrt-llm-python-runtime) - [CUDA Graph Compatibility: Grammar Computation](#cuda-graph-compatibility-grammar-computation) - [CUDA Graph Compatibility: Mask Applying Kernel](#cuda-graph-compatibility-mask-applying-kernel) - [Troubleshooting: Data Race between Host and CUDA Callback](#troubleshooting-data-race-between-host-and-cuda-callback) - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#troubleshooting-deadlock-by-gil-and-cuda-mutex) - [Performance and Analysis](#performance-and-analysis) - [Acknowledgements](#acknowledgements) ## Background and Challenges ### Motivation As part of our effort to bridge gaps in feature combinations, we enabled guided decoding with many important LLM inference features in TensorRT LLM over the last two months: * Overlap scheduler: [PR 6000](https://github.com/NVIDIA/TensorRT-LLM/pull/6000) * CUDA graph padding: [PR 6774](https://github.com/NVIDIA/TensorRT-LLM/pull/6774) * Disaggregated serving: [PR 6704](https://github.com/NVIDIA/TensorRT-LLM/pull/6704) * Speculative decoding (two-model implementation): [PR 6300](https://github.com/NVIDIA/TensorRT-LLM/pull/6300) * Speculative decoding (one-model implementation): [PR 6948](https://github.com/NVIDIA/TensorRT-LLM/pull/6948) More complicated (higher-order) combinations are also supported; for example, we can run DeepSeek-R1 with guided decoding, overlap scheduler, CUDA graph, [attention data parallelism (ADP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md), [multiple token prediction (MTP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md) and [disaggregated serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)​ all enabled. Among all these tasks, combining guided decoding with one-model speculative decoding is the most challenging one, and it achieves the best performance for low-latency or throughput@latency scenarios. This blog post shares the overall design, implementation details, and performance analysis. ### Guided Decoding Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are amenable to a user-specified grammar (e.g., JSON schema), which is particularly useful for LLM agents. For example, guided decoding can help an LLM generate function arguments that strictly conform to function signatures. Thus, the LLM can correctly call external tools and integrate the tool calling results for a better response. For a request at the prefill phase, guided decoding creates an initial grammar state (i.e., grammar compilation), and generates a mask tensor indicating which tokens from the vocabulary are allowed for the first generated token (i.e., mask gen). At each generation phase, guided decoding advances the grammar state based on the last generated token (i.e., grammar advance), and generates a mask tensor for the next token. The mask will be applied to the logits to mask out the disallowed tokens before sampling (i.e., mask applying), which ensures the next token is amenable to the grammar constraints. TensorRT LLM integrates third-party grammar backends (e.g., [XGrammar](https://github.com/mlc-ai/xgrammar), [LLGuidance](https://github.com/guidance-ai/llguidance)) for the grammar computation. Currently, these grammar backends are implemented on CPU, so the grammar computation introduces significant CPU overhead. Fortunately, this can be overlapped with the GPU computation, achieving [near-zero overhead](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). The core idea is that at every iteration, we should first launch the model forward to make the GPU busy, and then compute grammar compilation/advance and mask gen on CPU. Once both the computations finish, the mask can be applied to the logits before sampling.

Figure 1: Top: guided decoding timeline without overlapping. Bottom: guided decoding timeline with overlapping. (This figure is from the XGrammar paper.)

### Speculative Decoding Speculative decoding is a crucial feature in low-latency or throughput@latency LLM inference scenarios. For each request, a lightweight drafter proposes several draft tokens, and then the target model verifies the draft tokens in parallel. Hopefully, most draft tokens are accepted, and thus multiple tokens are generated in a single target model forward. Compared with normal LLM inference where each model forward generates a single token, speculative decoding offers the potential to generate more tokens per iteration by leveraging more computation. This improves the arithmetic intensity and reduces the required number of iterations. TensorRT LLM has two kinds of speculative decoding implementations, namely the one-model and two-model implementations. The one-model implementation launches a single CUDA graph for a target model forward together with multiple draft model forwards. This is more difficult to implement and is coupled with the modeling code, but it offers the best performance. The two-model implementation decouples the target and draft models into separate CUDA graphs, which is much more flexible and offers better feature coverage. There are ongoing efforts to close the gaps between the two implementations.

Figure 2: Top: GPU timeline of one-model speculative decoding. Bottom: GPU timeline of two-model speculative decoding.

### Two Challenges When combining guided decoding and speculative decoding, two challenges arise. First, at each generation iteration, speculative decoding proposes multiple draft tokens, some of which might be rejected in the verification step. The draft token proposal and rejection are not transparent to guided decoding. Specifically, this can be broken down into two views: * For the target model, guided decoding should advance the grammar state and generate the mask for every draft token. If some draft tokens are rejected, guided decoding should rollback the grammar state to the last accepted token. * For the draft model, without grammar constraints, some draft tokens may violate the grammar and thus be forcefully rejected in the verification step. Clearly, this hurts the acceptance rate. Hence, guided decoding should also intervene on the logits for every draft token generation if possible. * Some speculative algorithms propose draft tokens recurrently by computing logits and sampling (e.g., the standard draft-target model, EAGLE or MTP), similarly to a standard LLM. In that case, guided decoding can apply grammar constraints in a similar mask gen and applying way. * Some drafting algorithms work without logits sampling, which require other ways to apply the grammar constraints. Second, specific to the one-model speculative decoding where a single CUDA graph contains multiple (draft and target) model forwards, the CPU-GPU synchronization becomes challenging. Note that for every step $i$, there are two event waits: * The host waits for the *token event* that indicates the readiness of CPU tokens from step $i-1$. * The model forward stream waits for the *mask event* that indicates the readiness of GPU masks from step $i$.

Figure 3: The CPU-GPU synchronization for multiple model forwards.

Note that in the two-model implementation, the sampling is excluded from the CUDA graphs for better flexibility (Figure 2). From the CPU perspective, this offers a timing for the grammar computation. In particular, the mask event wait can be inserted between the CUDA graph replay and sampling, effectively making the GPU wait for the GPU masks asynchronously copied from CPU. However, the CUDA graph of the one-model implementation contains multiple forwards, inevitably including the sampling operations. Hence, there is no timing for the grammar computation. The most outstanding problem is that when replaying the CUDA graph, the mask event wait cannot be inserted before sampling. An alternative is capturing the events and waits in the CUDA graph, but it is still ineffective because the grammar computation is on CPU and thus not capturable. Once such a CUDA graph is launched to replay, the GPU does not wait for any newly recorded events, so it is impossible to block the GPU for the readiness of masks. ## Trace Grammar State for Draft Token Proposal and Rejection ### Target Model For a target model forward, a request should have one new token and multiple draft tokens from the last verification step and drafter, respectively. For each token in the sequence, guided decoding should advance the grammar state and fill the mask tensor. Before sampling, the masks should be applied to the corresponding logits. After verification, the grammar state should be rolled back by the number of rejected tokens. Compared to guided decoding with non-speculative decoding, the rollback operation is newly introduced. Thankfully, it has built-in support by grammar backends like [XGrammar](https://github.com/mlc-ai/xgrammar/blob/v0.1.21/python/xgrammar/matcher.py#L341-L350) and [LLGuidance](https://github.com/guidance-ai/llguidance/blob/v1.1.1/python/llguidance/_lib.pyi#L363-L366). Before proceeding to the draft model view, note that the LLM can generate correct outputs as long as we apply grammar constraints on the target model, because any draft tokens violating the grammar will be forcefully rejected by the verification step. However, this hurts the acceptance rate. ### Draft Model As aforementioned, we can apply grammar constraints for draft tokens in a similar mask gen and applying way for speculative algorithms based on recurrent logits sampling. Specifically, for the first drafting step, guided decoding advances the grammar state using the last new token. For the following drafting steps, the grammar state is advanced using the last draft token. Each step should fill and apply the mask to the corresponding draft model logits before sampling. After the drafting process, the grammar state should be rolled back to the original state, so that the subsequent target model forward can have the correct grammar state. If the draft and target models share the same vocabulary, then the grammar computation is exactly the same so the masks can be reused. One special case is EAGLE3, whose draft model has a [pruned vocabulary](https://github.com/SafeAILab/EAGLE/blob/58d1de099fe315645a82fe002e46586d54efe405/eagle/traineagle3/config.json#L22-L23) compared to the target model. For instance, LLaMA 3.1 has a 128k vocabulary size, while the corresponding EAGLE3 drafter has a vocabulary containing the most frequent 32k tokens. This saves some computation of the lm_head GEMM. Note that grammar is built on the target model’s vocabulary, so the produced mask cannot be directly applied to the logits of the draft model. EAGLE3 provides a special [d2t](https://github.com/SafeAILab/EAGLE/blob/d7161f9f94aaa345654d9b4045931145811d4d03/eagle/traineagle3/cnets.py#L673-L681) tensor that maps draft token IDs to target token IDs. [PR 7481](https://github.com/NVIDIA/TensorRT-LLM/pull/7481) fuses this d2t mapping to the mask applying kernel. > **Note:** Here we focus on the chain-based speculative algorithms. A tree-based algorithm would further complicate the implementation; in particular, guided decoding should traverse the drafting tree, advance and rollback grammar states accordingly. ## Make Grammar Computation Capturable by CUDA Graph ### CUDA Callback CUDA graph can help eliminate the CPU overhead, which is an important technique in the LLM inference systems, especially for the generation phase. As aforementioned, the one-model speculative decoding implementation launches a single CUDA graph to compute multiple draft and target model forwards. This makes the CPU-GPU synchronization challenging: the sampling operation depends on masks computed on CPU, but the GPU is not able to wait for the readiness of any CPU computation once the CUDA graph is launched. CUDA callback [`cudaLaunchHostFunc`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g05841eaa5f90f27124241baafb3e856f) can launch a host function to a CUDA stream. (The host function should not call any CUDA API.) This has two crucial implications: * CUDA events and event waits can be inserted before and after the host functions, which can be used to synchronize the CPU and GPU computation. * The host functions can be captured and replayed by CUDA graph. Hence, we can launch grammar computation along with other auxiliary host functions as CUDA callbacks to a CUDA stream. The CUDA graph should capture and replay multiple model forwards and corresponding grammar computation all together. To achieve CPU-GPU overlapping, the grammar computation should be placed on a dedicated CUDA stream. Specifically, for every step $i$: * The grammar stream: * waits for the *token event* that indicates the readiness of CPU tokens from step $i-1$; * performs grammar advance and mask gen (CUDA callback); * asynchronously copies the CPU masks to GPU; * records the *mask event*. * The model forward stream: * computes model forward using the last GPU tokens; * waits for the *mask event* that indicates the readiness of GPU masks; * applies the mask to logits and then samples new tokens; * asynchronously copies the GPU tokens to CPU; * records the *token event*.

Figure 4: The CPU-GPU synchronization for multiple model forwards by CUDA callback.

### Integration to TensorRT LLM Python Runtime We surveyed some off-the-shelf Python bindings implementations of `cudaLaunchHostFunc`, but it turned out that they do not work well with CUDA graph (e.g., CUDA-Python [Issue 790](https://github.com/NVIDIA/cuda-python/issues/790), cupy [Issue 9274](https://github.com/cupy/cupy/issues/9274)). The probable reason is that the intermediate wrapper data structures are released once the callback is executed; hence, even though the callback is captured by CUDA graph, it cannot be replayed multiple times. We implement our own bindings to `cudaLaunchHostFunc` — [`launch_hostfunc`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L76). Specifically, `launch_hostfunc` packs the Python function and arguments to an [intermediate data structure](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L33) and calls `cudaLaunchHostFunc` to launch a [trampoline function](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L49) to a CUDA stream. The trampoline function unpacks the intermediate data structure and invokes the Python function with the arguments. Note that `launch_hostfunc` offers great flexibility — it can launch an arbitrary Python function (without any CUDA API calls) as a CUDA callback. Hence, the grammar computation logics can still be implemented in Python. When CUDA graph is capturing, `launch_hostfunc` does not release the intermediate data structure, so it is accessible during CUDA graph replay. The intermediate data structures can be manually released via [`free_hostfunc_user_data`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L97); otherwise, they are automatically cleaned up when the Python interpreter exists. If CUDA graph is disabled (e.g., prefill phase), the intermediate data structure should be released promptly to avoid memory leak. Specifically, the trampoline function automatically releases it once the callback finishes execution. In Python, we provide a decorator `hostfunc` which casts an arbitrary Python function to a CUDA callback. For example, run the below code snippet: ```python import torch from tensorrt_llm._torch.hostfunc import hostfunc @hostfunc def increase(x: torch.Tensor): x.add_(1) x = torch.zeros(10, dtype=torch.int32) stream = torch.cuda.Stream() g = torch.cuda.CUDAGraph() with torch.cuda.graph(g, stream=stream): increase(x) increase(x) torch.cuda.synchronize() with torch.cuda.stream(stream): for _ in range(10): g.replay() torch.cuda.synchronize() print(x) ``` The output would look like: ```txt tensor([20, 20, 20, 20, 20, 20, 20, 20, 20, 20], dtype=torch.int32) ``` Note that the CUDA graph increases the tensor twice, and it is replayed for ten times, so the tensor should be totally increased by 20 times. Clearly, the output validates that the CUDA graph capture and replay are successful. As the final step, we implemented a variant of `GuidedDecoder` — [`CapturableGuidedDecoder`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L405). It reuses most logics from `GuidedDecoder`, but the grammar computation and some auxiliary methods are decorated by `hostfunc`, making it capturable by CUDA graph. ### CUDA Graph Compatibility: Grammar Computation Once captured, CUDA graph can be launched to run the same GPU kernels as many times as needed. Note that the replayed kernels are always executed using the fixed input and output memory addresses. By filling input buffers with new data, we can run the same work on new data. This pattern also applies to CUDA callback, except that the input and output buffers are on CPU. Guided decoder manages the below buffers and resources: * [Request states](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L20): All the necessary request information affecting grammar computation, including the user-specified grammar, the last new token and draft tokens. * [Grammar states](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L167-L168): The grammar states managed by grammar backends. By leveraging the grammar backends, guided decoder advances grammar states and fills mask tensors. * [New tokens tensor](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L419-L422): The tensor values are copied from the newly computed GPU tokens, and used to update the last new token or draft tokens of the request states. * [Mask tensor](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L175-L177): The tensor values are filled according to the grammar states and then copied to GPU masks, which will be used to apply to logits. The buffers are stored in fixed memories, and the resources are accessed via fixed pointers. This makes grammar computation compatible with CUDA graph. The buffers and resources are connected via slot IDs. In the runtime, each request is assigned with an exclusive slot ID (0 <= slot ID < `max_batch_size`) upon the first scheduling. The slot ID is occupied until the request is finished and removed from the scheduler. When the runtime schedules a new batch of requests, the guided decoder updates the request states on the host. After that, all the other operations (grammar compilation/advance, mask gen, buffer copying, etc.) happen on CUDA streams and should be capturable by CUDA graph. More specifically, buffer copying should be asynchronous, and the other CPU computation should be CUDA callbacks. ### CUDA Graph Compatibility: Mask Applying Kernel The mask applying kernel takes a batch of logits and masks as the input, and modifies the logits in-place. Specifically, the masked-out (disallowed by grammar) token logits are assigned a value of negative infinity, so that they are impossible to be sampled as the next tokens. Note that currently CUDA graph is enabled for the generation phase only, and the draft length is fixed for all requests. This greatly simplifies the effort for CUDA graph compatibility. Given a `batch_size` and `max_num_draft_tokens`, the logits tensor is of shape `(batch_size * (1 + max_num_draft_tokens), vocab_size)`. Clearly, we can fill the first `(batch_size * (1 + max_num_draft_tokens))` rows of the mask tensor accordingly, and pass the mask tensor address to the kernel. Some requests may have no grammar constraints. For such requests, we can fill the corresponding masks with all ones (allowed by grammar) so the logits will not be modified by the kernel, but this causes unnecessary computation. To resolve this, a token-level mask tensor is introduced. The tensor values are filled with zeros for requests without grammar constraints. The kernel reads these mask values and skips the rows with mask values being zero. ### Troubleshooting: Data Race between Host and CUDA Callback Similar to GPU kernels, CUDA callbacks are asynchronously executed on CUDA streams. Note that both normal host functions and CUDA callbacks can access the same CPU memory addresses, so it can easily cause a data race. In the initial implementation, `CapturableGuidedDecoder` directly reads request states from [`ScheduledRequests`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/scheduler.py#L18). However, the `ScheduledRequests` is shared through an executor iteration and thus probably modified by other executor components. This creates a potential data race scenario: * Guided decoder launches a CUDA callback, which will read some request states from `ScheduledRequests`; * Some other executor components inplace modify `ScheduledRequests`; * The CUDA callback is executed, reading some modified request states from `ScheduledRequests`. Clearly, the CUDA callback may read unexpected data. This data race motivates a dedicated request states class — [`GuidedRequest`](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L20). It is a request snapshot created for guided decoder only, so it will never be modified by other components. It is also possible that the guided decoder itself may access request states via both normal host functions and CUDA callbacks, so we adopt a protocol that the request snapshots should be created on the host, and then accessed only via CUDA callbacks. This prevents potential data race within an executor iteration. When overlap scheduler is enabled, another data race scenario exists between executor iterations: * Iteration $i$ launches CUDA callbacks, which will read request states from a fixed address; * Iteration $i+1$ updates the request states; * Iteration $i$'s CUDA callbacks are executed, reading request states updated by iteration $i+1$. Again, the CUDA callbacks may read unexpected data. A straightforward solution is letting the request state update wait for CUDA callback execution, but this effectively disables overlap scheduling. To resolve this issue and also unblock overlap scheduling, a [queue](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/tensorrt_llm/_torch/pyexecutor/guided_decoder.py#L417) is introduced. For each iteration, a new batch of request states is put into the queue; then, a CUDA callback is launched to fetch a new batch of request states from the queue, and all the subsequent CUDA callbacks access the newly fetched request states. This allows the co-existence of the request snapshots of two (or even more) iterations, which prevents potential data race between iterations. ### Troubleshooting: Deadlock by GIL and CUDA Mutex After the first version was implemented, the program intermittently encountered a hang issue when `CapturableGuidedDecoder` is enabled. By checking out the callstack, we found that it was hanging on completely irrelevant kernel launches or some other CUDA API calls. With further investigation, we discovered that the hang issue was caused by a deadlock between the Python GIL and a CUDA mutex. As documented, a CUDA callback must not make any CUDA API calls. This implies that the CUDA callback execution and CUDA API calls compete for the same mutex. Note that the trampoline function needs to [acquire the GIL](https://github.com/NVIDIA/TensorRT-LLM/blob/v1.1.0rc5/cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp#L52) before calling the Python code. Hence, when executing Python code by a CUDA callback, it acquires a CUDA mutex and then the GIL. In the meanwhile, the Python main thread may hold the GIL and make CUDA API calls, so it acquires the GIL and then the CUDA mutex. The two threads acquire the two locks in opposite orders, which creates a deadlock pattern. This deadlock can be resolved if the Python main thread can release the GIL for CUDA API calls. TensorRT LLM Python runtime is built on PyTorch. Thankfully, PyTorch releases the GIL for most CUDA API calls, even including PyTorch custom operators. However, we find two exceptions in PyTorch 2.8. When creating a device tensor using a shape depending on data from another device tensor, it triggers an implicit and synchronized D2H copy, and this D2H copy is executed with GIL being held ([Issue 163062](https://github.com/pytorch/pytorch/issues/163062)). This can be reproduced by the below code snippet: ```python import torch x = torch.randint(0, 100, (100,), dtype=torch.int64, device='cuda') y = torch.zeros(100, x.max(), dtype=torch.int64, device='cuda') ``` The other case is that `torch.compile` kernels are called with GIL being held ([Issue 163061](https://github.com/pytorch/pytorch/issues/163061)), although Triton kernels are called with GIL released. Hence, we have to avoid any problematic operators and disable `torch.compile` when using CUDA callback to Python code ([PR 7871](https://github.com/NVIDIA/TensorRT-LLM/pull/7871)), until these issues are fixed by PyTorch. Another source of risk comes from some runtime components that are implemented in C++ and exposed as Python bindings; they may make CUDA API calls as well. By default, Python bindings do not release GIL. Hence, we swept these Python bindings and released GIL properly ([PR 6948](https://github.com/NVIDIA/TensorRT-LLM/pull/6948)). After all these efforts, the hang issue disappears. It is generally recommended to release the GIL when calling C++ code from Python; even without the context of CUDA callback, this is beneficial for multi-threading performance. However, we acknowledge the limitation that it is difficult to make sure that every such place has been properly handled, and that future code changes do not introduce any risks. > **Note:** Theoretically, the GIL-free Python ([PEP 703](https://peps.python.org/pep-0703)) could be another remedy. ## Performance and Analysis We benchmark the performance of guided decoding on two datasets [JSON Mode Eval](https://huggingface.co/datasets/NousResearch/json-mode-eval) and [JSON Schema Bench](https://huggingface.co/datasets/epfl-dlab/JSONSchemaBench). The models are [LLaMA 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [LLaMA 3.3 70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), the GPUs are H200 and the grammar backend is XGrammar.

Figure 5: Pareto curve on LLaMA 3.1 8B TP1 on H200, JSON Mode Eval. The concurrency ranges from 1 to 128.

Figure 6: Pareto curve on LLaMA 3.3 70B TP4 on H200, JSON Mode Eval. The concurrency ranges from 1 to 128.

Figures 5 and 6 present the Pareto curves on JSON Mode Eval for LLaMA 3.1 8B and LLaMA 3.3 70B, respectively. Speculative decoding achieves significant speedup for low-latency or throughput@latency scenarios. In particular, the speedup can be up to ~2x for batch size 1. The one-model EAGLE3 implementation is more performant than the two-model EAGLE3, and this performance gap is amplified for small models. This is reasonable, because the one-model implementation captures more workloads into a single CUDA graph, which results in less (if any) exposed CPU overhead. Note that although NGram is a two-model implementation, it performs surprisingly well. This is because JSON Mode Eval is an information extraction task. Each prompt contains the JSON schema and all the information required by the response, so the NGram has a high acceptance rate on this dataset.

Figure 7: Pareto curve on LLaMA 3.1 8B TP1 on H200, JSON Schema Bench. The concurrency ranges from 1 to 128.

Figure 8: Pareto curve on LLaMA 3.3 70B TP4 on H200, JSON Schema Bench. The concurrency ranges from 1 to 128.

Figures 7 and 8 show the results on JSON Schema Bench. The one-model EAGLE3 achieves the best performance across almost all scenarios. Note that the NGram becomes less performant since the task is no longer an information extraction task, although the JSON schemas are still present in the prompts. | Dataset | Model | EAGLE3 | EAGLE3 w/o draft | NGram | | :-----: | :---: | :----: | :--------------: | :---: | | JSON Mode Eval | LLaMA 3.1 8B | 2.86 | 2.65 | 2.59 | | JSON Mode Eval | LLaMA 3.3 70B | 2.72 | 2.60 | 2.44 | | JSON Schema Bench | LLaMA 3.1 8B | 2.55 | 2.33 | 1.89 | | JSON Schema Bench | LLaMA 3.3 70B | 2.50 | 2.30 | 1.87 |

Table 1: Average acceptance lengths per iteration for EAGLE3 and NGram. The acceptance length includes the golden token. The draft length is 3. "EAGLE3 w/o draft" means the draft model does not apply grammar constraints.

Table 1 lists the average acceptance lengths per iteration. We perform an ablation experiment where the draft model does not apply grammar constraints. As presented, this does decrease acceptance rates, but by a slighter margin than expected. Note that it introduces extra overheads to apply grammar constraints on the draft model: * In the drafting loop, the extra mask applying kernels slightly contribute to the GPU time. * If the drafting forwards are too fast to hide the grammar computation, the exposed CPU time will cause bubbles in the GPU timeline. These extra overheads could partially offset the benefits from the improved acceptance. ## Acknowledgements This work demonstrates an outstanding example of cross-team collaboration between the TensorRT LLM and XGrammar teams. We sincerely appreciate the support from everyone who contributed to making this happen. We acknowledge that it is built on top of the tremendous existing foundations from the community. In particular, some designs were inspired by vLLM [PR 14702](https://github.com/vllm-project/vllm/pull/14702) and SGLang [PR 6499](https://github.com/sgl-project/sglang/pull/6499). In addition, special thanks go to the authors who proposed the speculative algorithms like EAGLE/MTP, and the grammar backend projects like XGrammar/LLGuidance. --- # Inference Time Compute Implementation in TensorRT LLM By NVIDIA TensorRT LLM Team and UCSD Hao AI Lab ## Table of Contents - [Inference-Time Compute Implementation in TensorRT LLM (Part 1: Design and Implementation)](#inference-time-compute-implementation-in-tensorrt-llm) - [Table of Content](#table-of-content) - [Background and Motivation](#background-and-motivation) - [Introduction for Scaffolding: A Framework for inference-time compute](#introduction-for-scaffolding) - [Core Features](#scaffolding-core-feature) - [Architecture](#scaffolding-architecture) - [Worker](#scaffolding-architecture-worker) - [Controller](#scaffolding-architecture-controller) - [ScaffoldingLlm](#scaffolding-architecture-scaffoldingllm) - [An Example: Implement Dynasor on Scaffolding](#example-for-scaffolding) - [Introduction for Dynasor](#dynasor-introduction) - [Implement Dynasor-CoT in Scaffolding](#dynasor-cot-implement-in-scaffolding) - [Implement Dynasor-CoT based Majority Voting in Scaffolding](#dynasor-cot-based-majority-vote-in-scaffolding) - [Acknowledgements](#dynasor-acknowledgements) - [Reference](#dynasor-reference) - [Feature List on Scaffolding](#scaffolding-feature-list) - [Future Work](#scaffolding-future-work) ## Background and Motivation Inference-time compute, also known as test-time scaling, is increasingly important. Beyond simply increasing output length, workflows such as best-of-N and Monte Carlo Tree Search (MCTS) offer additional capabilities for optimizing inference. Further, most of the workflows of agentic or multi-agent are logically similar to these methods of inference-time compute, except that they use more complex tools and context engineering. However, how to conveniently define these methods while achieving excellent inference performance has become a new problem. Because good performance requires careful asynchronous scheduling, but writing asynchronous scheduling programs is not easy for algorithm engineers. When considering the use of external tools and token budget management, the problem becomes even more complex. LLM inference frameworks such as TensorRT LLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments. So we want to build a good framework to support users in exploring and deploying more inference-time compute methods. It should provide a modular infrastructure and fill the gap in balancing usability and performance for inference-time compute. ## Introduction for Scaffolding: A Framework for inference-time compute `Scaffolding` is a framework for inference-time compute with high performance. It makes it easy for users to integrate various methods (CoT, majority vote, best of N, MCTS) and execution backends (TRTLLM/Openai API/Tools) and also allows users to develop customized features such as token budget. ### Core Features The core features including: Decouple inference-time compute method and execution backend. Provides `Controller` concept for users to define the method, `Worker` concept to develop execution backend and `ScaffoldingLlm` to provide API for users to integrate `Controller` and `Worker` and run the request. Make the inference-time compute method modular and reusable. An inference time compute method can be composed of multiple modules. In scaffolding, `Controller` can be constructed by a series of `Sub-Controllers`, then users can flexibly assemble and replace the `Sub-Controllers`. Provides sufficient concurrency to achieve good performance while ease of use. Concurrency is the key for performance. `Scaffolding` provides three levels of concurrency. The first level is that the different requests to a `ScaffoldingLlm` instance can be concurrent. The second level is that the multiple `Sub-Controllers` can be concurrent.The third level is that the multiply Tasks which yielded from `Controller` can be concurrent. ### Architecture `Scaffolding` consists of three core components. Let's first briefly introduce these components. The `Worker` class is the backend that execute a single task, such as sending an inference request to an LLM inference framework or service, or completing a call to an external tool. The `Controller` class focuses on defining the workflow of a inference-time compute method. The `ScaffoldingLlm` class is responsible for integrating the two and completing the entire task. This is the call sequence diagram of `Scaffolding`:
Scaffolding Sequence

Figure 1. Scaffolding Sequence

Here we can focus on two points. First, `ScaffoldingLlm` provides users with the interface. Second, the `Controller` does not directly call the Worker. Next, we will introduce the code of the core components. #### Worker ```python class Worker(ABC): async def run_task(self, task: Task) -> TaskStatus: worker_cls = type(self) if type(task) not in worker_cls.task_handlers: return TaskStatus.WORKER_NOT_SUPPORTED return await worker_cls.task_handlers[type(task)](self, task) task_handlers = {} ``` The core interface of `Worker` is `run_task()`, which accepts a `Task`, executes it, and writes the result to the appropriate field. It should be noted that `run_task()` is an asynchronous function and it can be concurrently and asynchronously called with python asyncio. #### Controller ```python class Controller(ABC): def __init__(self): self.task_collections = {} def clone(self): return copy.deepcopy(self) def generate(self, prompt: str, **kwargs) -> GenerationResult: task = GenerationTask.create_from_prompt(prompt) yield from self.process([task], **kwargs) return task.create_scaffolding_output() def process(self, tasks: List[Task], **kwargs): raise NotImplementedError ``` Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generate()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method. Let's go into a specific subclass of `Controller` to see how `process()` is implemented. ```python class NativeGenerationController(Controller): class WorkerTag(Enum): GENERATION = "generation" def process(self, tasks: List[Task], **kwargs): for task in tasks: task.worker_tag = self.WorkerTag.GENERATION for key, value in self.sampling_params.items(): if getattr(task, key) is None: setattr(task, key, value) task.streaming = self.streaming yield tasks ``` Essentially, `process()` is an iterator in python that can return a list of tasks using yield statement. When the iterator is re-entered, that is, when the yield statement ends, the `Tasks` have been completed. That means the result of the `Task` has been written into its result field. Then the `process()` can proceed to the next steps. From here we can see that the implement of the `Controller` can focus on the design of the workflow. It does not directly call the `Worker` and does not need to care about how these tasks are completed. And that is how `Scaffolding` decouple inference-time compute method and execution backend. Also, `Controller` makes the inference-time compute method modular and reusable. It only requires the `sub-Controller` to be a member of class, and then the `process()` function of the `sub-Controller` is called using the “yield from” statement. ```python yield from self.reward_controller.process(generation_tasks, **reward_kwargs) ``` For the concurrency with ease of use, `Controller` provides two ways. As the code above shows, the yield statement yield a list of `Task`. So the first one is that the multiple Tasks in a yield statement is executed in parallel. The second way is for the multiple `sub-Controller` which can be executed in parallel. `Controller` provides syntactic sugar called `ParallelProcess`. ```python generation_controllers = [ self.generation_controller for _ in range(sample_num) ] generation_kwargs_list = [generation_kwargs for _ in range(sample_num)] generation_tasks = [copy.deepcopy(task) for _ in range(sample_num)] yield ParallelProcess(generation_controllers, [[t] for t in generation_tasks], generation_kwargs_list) ``` #### ScaffoldingLlm With `Controller` and `Worker`, we still need something that can combine them together, that is the `ScaffoldingLlm` class. ```python llm_worker = TRTLLMWorker.init_with_new_llm( args.model_dir, backend="pytorch", max_batch_size=32, max_num_tokens=4096, ) prototype_controller = NativeGenerationController(sampling_params={ "temperature": 0.9, "max_tokens": 1024, }) llm = ScaffoldingLlm( prototype_controller, {NativeGenerationController.WorkerTag.GENERATION: llm_worker}, ) results = llm.generate(prompts) ``` Users need to first create instances of `Worker` and `Controller`, and map them by `WorkerTag` to create the `ScaffoldingLlm` class. Then call the generate interface of `ScaffoldingLlm` to get the final result. `ScaffoldingLlm` also provides async interface. ```python async for result in llm.generate_async(prompt): print(">>>", result.outputs[0].text) ``` Therefore, an instance of ScaffoldingLlm supports concurrent execution of multiple requests. Let's make a summary of the overall implementation of `Scaffolding`. If users want to implement a new inference-time compute method, users can develop a new `Controller`. They can also call some existing `Controllers` as its `sub-Controller`. If users want to implement a new backend, users can either create a new `Worker` or add a new `Task` handler to an existing `Worker`. As for `ScaffoldingLlM`, we have hidden many complex implementations, such as async scheduling within `ScaffoldingLlM`, and users do not need to modify the code of `ScaffoldingLlM`. ## An Example: Implement Dynasor-CoT on Scaffolding Dynasor-CoT arXiv is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.
Dynasor Demo

Figure 2. Demo of DeepSeek-R1-Distill-Qwen-7B achieving a 5.74x speedup compared to the baseline when using Dynasor-CoT on MATH500

### Introduction for Dynasor-CoT #### Motivation of Dynasor-CoT LLM reasoning is highly token-inefficient, often requiring far more tokens to achieve the same accuracy as non-reasoning models. A major source of this inefficiency is that reasoning models tend to **self-doubt**; they often reach the correct answer early but then engage in extended verification behaviors like double-checking and reassessment. For instance, Figure 2 compares a traditional Qwen-7B model with a reasoning-focused, Deepseek-distilled Qwen-7B model on a simple question. While the traditional model reaches its answer in 180 tokens, the reasoning model expends 1,000 tokens on iterative verification, despite having already found the correct answer at token 340. This represents a significant waste of tokens for diminishing returns on accuracy.
Motivation

Figure 2. An example answer from reasoning model (Deepseek-distilled Qwen-2.5 7B) vs traditional model (Qwen-2.5 7B) on one of the problem in MATH500 dataset.

#### The "Probe" technique Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique, which prompts reasoning models to output early-stage results during intermediate steps of reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are. More specifically, a probe is an extra generation request with an eliciting prompt appended to the intermediate reasoning tokens. One effective eliciting prompt is: `Oh, I suddenly got the answer to the whole problem, Final Answer: boxed{`. Figure 3 shows an analysis comparing the accuracy of directly asking versus probing the model. Taking AMC23 as an example, reasoning models frequently arrive at correct answers early (median: 830 tokens) but continue generating unnecessary tokens due to self-doubt (median: 2.7K tokens).
Dynasor Demo

Figure 3. DeepSeek-R1's performance on AMC23 and AIME24 at varying token budgets. (Left) Standard reasoning with late answer outputs. (Right) Early answer extraction using the Probe-In-The-Middle technique, demonstrating equivalent accuracy with a 50% token reduction. The greener regions in the right panels suggest the model knows the answers much earlier than it reveals in standard reasoning.

#### How it speeds up inference Instead of generating a fixed number of tokens or waiting for a stop token, Dynasor-CoT **probes the model regularly** (e.g., every 32, 64, or 128 tokens) and **terminates the process** early once a consistent answer is formed across recent probes. This avoids unnecessary computation, directly reducing latency. Figure 4 provides an illustration: * **Case 1**: All three probe requests yield the same answer, "3159.", indicating high certainty. The process can exit early. * **Case 2**: Early-stage answers are inconsistent, indicating low confidence, so generation continues. * **Case 3**: The model generates special tokens such as "wait" or "hmm," signaling hesitation; generation continues.
Dynasor Illustration

Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are discarded.

### Implement Dynasor-CoT in Scaffolding A key difference between inference-time compute methods like Dynasor-CoT and a normal LLM generation request is that the generation process can consist of multiple smaller, user-defined tasks. The results of these tasks can dynamically control the overall logic—for example, by determining whether to expand the scope of subsequent generation or to terminate the process entirely. In a single Dynasor-CoT request, generation proceeds chunk by chunk, with additional "probe" tasks running in parallel with the main generation. Once a consistent answer is formed across recent probes, the process terminates early. `Scaffolding` provides a good solution for customizing these kinds of data flows. Within a `Controller`, we can customize the data flow logic by defining how and when these smaller tasks are submitted. To implement Dynasor-CoT, we simply inherit from the base `Controller` class and override the `process()` function to customize how it yields tasks. We don't need to worry about how these tasks are executed because the inference-time compute methods and the execution backend are modularized and decoupled in Scaffolding. These tasks are submitted to `ScaffoldingLlm`, which then dispatches workers to complete them. Let's start the implementation by inheriting the `Controller` class and adding the necessary parameters for Dynasor-CoT. ```python class DynasorGenerationController(Controller): class WorkerTag(Enum): GENERATION = "generation_with_dynasor_cot" def __init__( self, generation_dir, max_tokens=8192, certainty_threshold=3, chunk_size=64, streaming=False, ): super().__init__() self.generation_dir = generation_dir self.max_tokens = max_tokens self.certainty_threshold = certainty_threshold self.chunk_size = chunk_size self.uncertain_words = ["wait", "hold", "but", "okay", "no", "hmm"] self.probe_suffix = "... Oh, I suddenly got the answer to the whole problem, **Final Answer**\n\n\\[ \\boxed{" self.answer_suffix = "\n\n... Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{" self.answer_suffix_with_marker = "\n\n...\n Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{" self.tokenizer = AutoTokenizer.from_pretrained( self.generation_dir, legacy=False, padding_side='left', truncation_side='left', trust_remote_code=False, use_fast=True, ) self.streaming = streaming ``` The `process()` function, as mentioned before, is the core method within the `Controller` class. Here, we can customize our data flow by specifying the logic for yielding tasks. For Dynasor-CoT, we have two different kinds of tasks: 1. `proposer_task`: Handles the main content generation, producing self.chunk_size tokens based on the previous content. 2. `probe_task`: Elicits an early-stage answer by generating 20 tokens from the same content. The code below creates these two types of tasks. ```python def process(self, tasks: List[GenerationTask], **kwargs): # Start with the initial prompt provided by the first task. initial_prompt = tasks[0].input_str proposer_task = GenerationTask() proposer_task.max_tokens = self.chunk_size proposer_task.temperature = 0.6 proposer_task.top_p = 0.95 proposer_task.worker_tag = self.WorkerTag.GENERATION probe_task = GenerationTask() probe_task.max_tokens = 20 probe_task.temperature = 0.6 probe_task.top_p = 0.95 probe_task.worker_tag = self.WorkerTag.GENERATION probe_answers = [] probe_responses = [] initial_prompt_token_num = len( self.tokenizer.encode(initial_prompt, add_special_tokens=False)) probe_suffix_token_num = len( self.tokenizer.encode(self.probe_suffix, add_special_tokens=False)) current_prompt = initial_prompt ``` To prevent extra latency, the `proposer_task` should not be blocked by the `probe_task`. Scaffolding's task-level concurrency handles this perfectly. We can yield `proposer_task` and `probe_task` in a single list. Multiple tasks yielded together in the same list will be batched and executed in parallel. ```python yield[proposer_task, probe_task] ``` In the following `for` loop, each iteration performs these steps: 1. **Submit** both a proposer task and a probe task by yielding them. We don't need to worry about execution details, as they are handled by `ScaffoldingLlm`, which binds the `Controller` and `Workers` together behind the scenes. 2. **Evaluate** the probe response after the tasks return, checking for consistency over several rounds (using `certainty_threshold`). 3. **Finalize** the answer and return if it is consistent. Otherwise, append the new tokens from the proposer task and proceed to the next iteration. ```python # Iterate over generation rounds until the maximum tokens limit is reached. for _ in range(initial_prompt_token_num + probe_suffix_token_num, self.max_tokens, self.chunk_size): proposer_task.input_str = current_prompt probe_task.input_str = current_prompt + self.probe_suffix # For the probe task, append the suffix to force a chain-of-thought leading to an answer. yield [proposer_task, probe_task] # Retrieve the output from the probe task. probe_text = probe_task.output_str # Extract the potential answer from the probe response. answer = self.obtain_answer(probe_text) probe_answers.append(answer) probe_responses.append(probe_text) if self.should_early_stop(probe_answers, probe_responses): tasks[0].result = probe_task.result # If the current prompt indicates the chain-of-thought phase has ended, use one type of suffix. if "" in current_prompt: tasks[0].output_str = (current_prompt + self.answer_suffix + probe_answers[-1] + "}\n\\]") return else: # Otherwise, use the suffix with marker to transition clearly. tasks[0].output_str = (current_prompt + self.answer_suffix_with_marker + probe_answers[-1] + "}\n\\]") return # If the answer is not deemed confident, perform another round of generation. # Append the newly generated text from the proposer to the current prompt for the next iteration. current_prompt += proposer_task.output_str # If the maximum token limit is reached without satisfying the certainty condition, # output the accumulated prompt as the final output. tasks[0].result = proposer_task.result tasks[0].output_str = current_prompt return ``` The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRT LLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state. Now we have implemented a `Controller` for Dynasor-CoT. Here is an example of how to use it: ```python dynasor_generation_controller = DynasorGenerationController( # Parameters for DynasorGenerationController ) llm = ScaffoldingLlm( prototype_controller=dynasor_generation_controller, # other parameters for ScaffoldingLLM ) results = llm.generate(prompts) ``` ### Implement Dynasor-CoT based Majority Voting in Scaffolding Scaffolding is designed to be modular and reusable. We can assemble methods just like LEGO building blocks. For instance, to implement Dynasor-CoT-based Majority Voting, we can simply stack our `DynasorGenerationController` with a `MajorityVoteController`. Once a controller for majority voting is built, no further implementation is needed. We can directly stack the two controllers as shown below. ```python dynasor_generation_controller = DynasorGenerationController( # Parameters for DynasorGenerationController ) majority_vote_controller = MajorityVoteController( generation_controller=dynasor_generation_controller, # stack here # Other parameters for MajorityVoteController ) llm = ScaffoldingLlm( prototype_controller=majority_vote_controller, # Expose the outermost controller to ScaffoldingLlm # other parameters for ScaffoldingLLM ) results = llm.generate(prompts) ``` ### Acknowledgements This work demonstrates an outstanding example of cross-team collaboration between the TensorRT LLM and UCSD Hao AI Lab. We sincerely appreciate the support from everyone who contributed to making this happen. ### Reference [1] Y. Fu*, J. Chen*, Y. Zhuang, Z. Fu, I. Stoica, and H. Zhang, "Dynasor: More Efficient Chain-of-Thought Through Certainty Probing," Hao-AI-Lab Blog, Feb. 16, 2025. [Online]. Available: https://hao-ai-lab.github.io/blogs/dynasor-cot/ ## Feature List on Scaffolding You can customize your own `Controller`, `Worker` and `Task`, however, we have provided a foundational set with commonly used functionality that you can use. `Worker`: TensorRT LLM, OpenaiAPI, MCP; `Task`: Generation, Reward, ToolCall; `Controller`: MajorityVote, PRMReward, BestOfN, MCTS; ## Future Work The future work is divided into two parts. The first part is to enable `Scaffolding` to support more inference-time compute methods, especially the methods of agentic and multi-agent system. The second part is that we hope to find more opportunities to optimize TensorRT LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately. Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding). --- # Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) This blog post is a continuation of previous posts: * [Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) * [Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md) In this blog post, we focus on performance optimization, diving deeper into techniques such as lower precision, network structure refactoring, and aggressive kernel fusion. We hope this analysis and optimization process brings new inspiration to your model inference optimization work. *By NVIDIA TensorRT LLM Team* ## Table of Contents - [Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)](#scaling-expert-parallelism-in-tensorrt-llm-part-3-pushing-the-performance-boundary) - [Table of Contents](#table-of-contents) - [Overview](#overview) - [Lower precision](#lower-precision) - [wo GEMM FP4 quantization](#wo-gemm-fp4-quantization) - [Low precision `AlltoAll`](#low-precision-alltoall) - [FP8 context FMHA support](#fp8-context-fmha-support) - [Rethink network structure](#rethink-network-structure) - [MTP LM head tensor parallelism](#mtp-lm-head-tensor-parallelism) - [Context phase Q/K/V `concat` optimization](#context-phase-qkv-concat-optimization) - [More kernel overlap, fusion and optimization](#more-kernel-overlap-fusion-and-optimization) - [Overlap kernels using programmatic dependent launch (PDL)](#overlap-kernels-using-programmatic-dependent-launch-pdl) - [Fuse several `AlltoAll` kernels](#fuse-several-alltoall-kernels) - [Fuse `add` (sparse exp and shared exp) into local reduction](#fuse-add-sparse-exp-and-shared-exp-into-local-reduction) - [Optimize PyTorch native `copy` and `concat` using `torch.compile`](#optimize-pytorch-native-copy-and-concat-using-torchcompile) - [End-to-End Performance](#end-to-end-performance) - [Acknowledgements](#acknowledgements) ## Overview Let's firstly take a look at how the network structure looks like before we did the optimizations, to give an overall review on how the workloads look like:

Figure 1: Network structure overview before optimization

In this third blog of our scaling Expert Parallelism (EP) series, we push the performance boundaries of large-scale EP on NVIDIA GB200 NVL72 through multiple optimization techniques. Building upon the foundation established in [part 1](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) and [part 2](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md), this blog explores three key optimization pillars: **lower precision computation** (including FP4 quantization for wo GEMM, low-precision AlltoAll communication, and FP8 context FMHA), **network structure rethinking** (featuring MTP LM head tensor parallelism and context phase Q/K/V concatenation elimination), and **aggressive kernel fusion and overlap** (leveraging Programmatic Dependent Launch, fused AlltoAll operations, and torch.compile optimizations). These optimizations collectively deliver significant end-to-end performance improvements for wide-EP scenarios on NVIDIA GB200 NVL72, for DeepSeek R1 with its specialized Multi-head Latent Attention (MLA) mechanism. Each technique is carefully designed to maintain accuracy while maximizing performance, demonstrating the power of combining algorithmic innovation with deep hardware awareness. ## Lower precision ### wo GEMM FP4 quantization The wo GEMM is the final linear layer within the multi-head attention block that produces the final outputs. While DeepSeek R1's MLA modifies the initial projections for keys and values, the wo GEMM operator remains a critical and standard component for finalizing the attention computation. In the term, "wo" is the abbreviation for the weight matrix for the output. We've evaluated that quantizing the wo GEMM to FP4 still satisfies the accuracy requirements, maintaining a similar MTP accept rate (AR) while improving end-to-end performance. The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) team has published checkpoints that additionally quantize the wo module in attention layers to FP4 on HuggingFace: * https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2 * https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2 In TensorRT LLM, this is supported by [PR 6393](https://github.com/NVIDIA/TensorRT-LLM/pull/6393). To utilize the checkpoints, simply use the LLM API or `trtllm-serve` to load them. Refer to [deploy-with-tensorrt-llm](https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2#deploy-with-tensorrt-llm) for more details. ### Low precision `AlltoAll` In wide-EP MoE, the combine phase (after experts finish FC2) performs an all-to-all to return each token’s expert outputs to its origin rank, followed by a per-token reduce over top-k experts. This step is typically bandwidth-bound when FC2 outputs are in BF16 or FP16. We introduce a low-precision AlltoAll that transmits these combine payloads in NVFP4 instead of BF16/FP16, then dequantizes back on the receiver before the local reduction. During combine, we temporarily quantize the per-token expert outputs to NVFP4 (e2m1 values with per-16-element E4M3 scale factors plus a global scale) inside shared memory, send the compact representation across GPUs, and dequantize back to the original dtype on the receiving side. Indices and routing-related small tensors remain in their native types. Since we quantize only for transport and outputs are dequantized back to the working dtype before the per-token reduction, we observe negligible accuracy impact; tolerances comparable to a quant-dequant roundtrip are sufficient. This feature is supported by [PR 7155](https://github.com/NVIDIA/TensorRT-LLM/pull/7155) and [PR 7898](https://github.com/NVIDIA/TensorRT-LLM/pull/7898). ### FP8 context FMHA support FP8 context FMHA is a technique that uses the FP8 data format to accelerate the FMHA/MLA computation during the context phase of a model. This combination is designed to improve TTFT and prefill throughput, particularly when processing long contexts, without significantly sacrificing accuracy. In the context phase, the K and V can be stored in FP8 format, which is often referred to as FP8 KV Cache. Using FP8 KV cache can significantly save GPU memory, which is especially beneficial for long input sequences. However, since Q is in BF16 format, FMHA will also be performed in BF16 format, which cannot benefit from FP8 Tensor Core. With FP8 context FMHA, we first quantize Q into FP8 format, which aligns with FP8 K and V, and then leverage FP8 Tensor Core for FMHA/MLA. Since the context phase is compute-bound and Tensor Core has much higher FP8 FLOPS than BF16 FLOPS, the speed-up becomes more pronounced as the input sequence length grows. Since FP8 context FMHA can maintain accuracy very close to the BF16 baseline, we enable it automatically when users use FP8 KV cache on Hopper or Blackwell. This is supported by [PR 7610](https://github.com/NVIDIA/TensorRT-LLM/pull/7610) and [PR 7612](https://github.com/NVIDIA/TensorRT-LLM/pull/7612). ## Rethink network structure ### MTP LM head tensor parallelism The LM (language modeling) head is responsible for converting the `hidden_states` computed by previous decode layers to `logits`. It's a linear layer with weights in the shape of `(vocab_size, hidden_size)`, outputting logits with the shape of `(batch_size, seqlen, vocab_size)`. We are primarily interested in the logits corresponding to the last token of the input sequence, so the logits will finally be `(batch_size, vocab_size)`. When MTP is enabled, the number of tokens that MTP layers handle will be equal to the batch size, while the main model will handle `(1 + MTP) * batch_size` tokens, which makes the LM head computation on MTP layers easier to fall into the memory-bound range, and 256 tokens is the empirical boundary between memory-bound and math-bound. This leads to an optimization idea: if we keep the calculation memory-bound but reduce the size of weights that need to be loaded, there could be performance benefits. Based on this analysis, we conducted experiments on the following scenario: a DeepSeek R1 EP32 case with attention DP and MTP-3 enabled, where the local per-rank batch size is 32. Before the optimization, there is 32-way data parallelism, so each MTP module on each rank processes 32 tokens for LM head calculation.

Figure 2: MTP LM head computation before optimization

In the optimization, we first perform an `AllGather` on every 4 GPUs, so that each GB200 node has all tokens prepared for the following TP4 calculation. Then, we split LM head weights on the token dimension to fit those 4 GPUs and perform 4-way TP. Afterwards, we collect the local argmax logits on each TP rank, do a round of `AllGather` to collect that, and find the global argmax logits across all TP ranks. Collecting the local argmax logits firstly helps with minimizing communication and argmax computation overheads. Finally, we split logits to guarantee correctness.

Figure 3: MTP LM head computation after applying tensor parallelism

*Some layers are omitted in the diagrams above to keep the example simple.* Note that we can expand the TP to 8-way to utilize multi-node NVLink, as long as we still achieve performance gains from reducing weight loading time in memory-bound scenarios. This feature is supported by [PR 7571](https://github.com/NVIDIA/TensorRT-LLM/pull/7571) and [PR 7891](https://github.com/NVIDIA/TensorRT-LLM/pull/7891). ### Context phase Q/K/V `concat` optimization In the standard attention mechanism, Q/K/V are derived from the same hidden states through `GEMM_Q`/`GEMM_K`/`GEMM_V` operations, and TensorRT LLM typically merges the weights of these three GEMMs in advance, executing a single `GEMM_QKV` to obtain a large contiguous tensor QKV, which is then used as the input to the attention kernels. However, DeepSeek's MLA is a special attention module where Q/K/V are obtained by applying different downsampling-upsampling processes to the hidden states. Additionally, Q and K are divided into two parts: with RoPE and without RoPE, so a contiguous QKV tensor cannot be obtained directly. In the initial implementation of context MLA, due to input format constraints of the attention kernels, TensorRT LLM had to explicitly concatenate the Q/K/V tensors into one contiguous QKV tensor, resulting in extra memory and time overhead, which became more significant in wide EP scenarios. Recently, we introduced a new input format for the context MLA kernels called "separate qkv". As the name implies, these attention kernels now support three separate Q/K/V tensors as direct inputs. [PR 6538](https://github.com/NVIDIA/TensorRT-LLM/pull/6538) refactors the MLA process to eliminate the need for concatenating Q/K/V, saving copy operations and significantly improving prefill latency in wide EP scenarios. ## More kernel overlap, fusion and optimization The team has implemented aggressive kernel fusion, overlap, and optimization to reduce kernel launch overheads and overall kernel duration. This includes overlapping kernels using PDL, fusing several `AlltoAll` kernels through refactoring, fusing sparse exp and shared exp `add` into local reduction, fusing `memset` into `expandinputrow`, fusing `finalizeMoeRouting` into FC2, and removing the `swizzle` kernel after `AlltoAll`. The following three representative examples demonstrate the common ideas behind these optimizations. ### Overlap kernels using programmatic dependent launch (PDL) The Programmatic Dependent Launch (PDL) mechanism allows a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing. Refer to the [official documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) for more details. TensorRT LLM has been utilizing this feature to optimize end-to-end performance. We have introduced this feature to the kernels used by the wide EP workflow as well. The implementation is in [PR 7977](https://github.com/NVIDIA/TensorRT-LLM/pull/7977). We inserted the `cudaTriggerProgrammaticLaunchCompletion` API with all thread blocks in the primary kernel, which signals that it's ready for the secondary kernel to launch, and then call the `cudaGridDependencySynchronize` API in the secondary kernel, which blocks until all primary kernels the secondary kernel depends on have completed and flushed results to global memory. The following example from the [official documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#api-description) demonstrates how PDL is supported in TensorRT LLM, the only difference is that we inserted `cudaTriggerProgrammaticLaunchCompletion` and `cudaGridDependencySynchronize` to the same kernel so that it can both overlap with the front and subsequent kernels. ```c __global__ void primary_kernel() { // Initial work that should finish before starting secondary kernel // Trigger the secondary kernel cudaTriggerProgrammaticLaunchCompletion(); // Work that can coincide with the secondary kernel } __global__ void secondary_kernel() { // Independent work // Will block until all primary kernels the secondary kernel is dependent on have completed and flushed results to global memory cudaGridDependencySynchronize(); // Dependent work } ``` We have verified the accuracy after the modification to ensure that computation results are not affected by incorrect memory reads and writes. With this premise, we made those kernels overlap as much as possible for performance considerations. In TensorRT LLM, PDL can be enabled by setting the environment variable `TRTLLM_ENABLE_PDL` to `1`, and we may introduce this as an official API in the future. The effect of enabling PDL can be clearly observed using [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems). Taking `moeComputeRouteKernel`, `computeCountAndIndiceDevice` and `computeCumsumDevice` kernels as an example, they are executed in order when disabling PDL:

Figure 4: The profiling results of disabling PDL.

The following profiling results show how the three kernels overlap after enabling PDL.

Figure 5: The profiling results of enabling PDL.

*The above profiles were generated by using commit [84d2f12](https://github.com/NVIDIA/TensorRT-LLM/tree/84d2f1281857fbb1662b14603d3123cf327ac94f) on the main branch. They may change in future versions.* For tips on using Nsys to profile and analyze TensorRT LLM performance, refer to [Coordinating with NVIDIA Nsight Systems Launch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-analysis.md#coordinating-with-nvidia-nsight-systems-launch). ### Fuse several `AlltoAll` kernels To better support communication fusion—including `hiddenStates` during dispatch, low-precision ScalingFactor, MoE's `tokenSelectedExpert` and scales, as well as supporting low-precision communication during dispatch and handling potential non-alignment issues in original data, we redesigned and reimplemented `AlltoAll`. Taking the dispatch of four fields as an example, the data flow is shown in Figure 6.

Figure 6: The data flow of new Alltoall kernel

The sending process is as follows: - The first step loads the original data according to the data alignment in global memory, using TMA to load into shared memory as `unAlignedData`. - Next, in shared memory, all fields are aligned to 16-byte boundaries and different fields are concatenated together to form `alignedData`. - If low-precision communication is needed, the aligned data is quantized into low-precision `lowPrecisionData`. Currently, quantization is only supported for a single field. - Next, corresponding encoding is performed according to the protocol. For example, with LL128, each 128 bytes contains 120 bytes of valid data and 8 bytes of flags. To avoid bank conflicts during encoding in shared memory, we select different flag positions for different packets, and the final encoded data is stored in `protoPackedData+Flag`. - Finally, the proto-encoded `protoPackedData+Flag` is written to the remote GPU's workspace. For the receiver, it only needs to check the flag at the corresponding position in the workspace to confirm whether the data is ready. If ready, the original data is decoded in the reverse manner of sending and written to the corresponding tensors. Through this approach, we can support sending and receiving multiple arbitrarily aligned fields in a fused manner and support low-precision communication during the combine process. This feature was implemented in [PR 6973](https://github.com/NVIDIA/TensorRT-LLM/pull/6973). ### Fuse `add` (sparse exp and shared exp) into local reduction To reduce the number of kernel launches and achieve better overlap at the tail of the MoE module, we've fused the shared-expert add into the local reduction kernel that aggregates top-k experts. This removes the extra add operator without increasing the reduce operator's overhead. It also achieves single write-out and lower bandwidth occupancy. The optimization is compatible with NVFP4 combine without requiring any API changes and brings no accuracy impact. It was added by [PR 7422](https://github.com/NVIDIA/TensorRT-LLM/pull/7422). ### Optimize PyTorch native `copy` and `concat` using `torch.compile` We have observed several inefficient `copy` and `concat` operations on context phase in wide EP scenarios, and one significant case is copying `k_nope` in the MLA module. As mentioned in previous section, Q and K are divided into two parts in DeepSeek MLA: with RoPE and without RoPE. In context phase, head size of nope will be 128, and that of rope will be 64, which adds up to 192 head size. However, the FMHA kernel will directly read Q and K with head size 192, which means that we have to prepare the full Q and K using `copy` and `concat`. On ISL/OSL 8k/1k, batch size 1 cases, on context phase, we observed that the `copy` operation takes 306us, which is clearly suboptimal. If we try to calculate a theoretical duration, considering 8 TB/sec HBM3e bandwidth, the formula would roughly be: ``` ( ISL 8192 * k_nope_size 128 * num_heads 128 * 2 bytes * read/write 2 ) / ( 8 TB/sec * efficiency 0.8 ) = 80 us ``` To optimize the operator, we simply added `torch.compile` decorator to the operation, and the kernel duration directly drops to 107us, which is greatly reduced and already on a promising level. [PR 8044](https://github.com/NVIDIA/TensorRT-LLM/pull/8044) implemented the changes. This is an outstanding example demonstrating the power of `torch.compile`, and showing the process of analyzing and optimizing without heavily hand-crafting kernels. ## End-to-End Performance After applying the optimizations above, the network structure is cleaner. For example, `o_proj` and `A2A tokens` now compute in lower precision, and operators like `add` of sparse‑expert and shared‑expert is now fused into the `reduction`. The optimized parts are marked in **bold**.

Figure 7: Network structure overview after optimization

We measured one round of performance and compared it with the baseline (main branch in July). With the optimizations mentioned above, we can see a significant performance improvement.

Figure 8: End-to-End Performance on Aug 31st

*Note: The numbers were collected on August 31st. Some optimizations mentioned above were not yet added at that time.* To review how wide EP helps with Blackwell's leading inference benchmarks, also read these recent blog posts: * [NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX™ v1 Benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/) * [NVIDIA Blackwell Raises Bar in New InferenceMAX Benchmarks, Delivering Unmatched Performance and Efficiency](https://blogs.nvidia.com/blog/blackwell-inferencemax-benchmark-results/) ## Acknowledgements This is a great continuation of previous work on TensorRT-LLM wide EP and another demonstration of excellent teamwork. It stems from brilliant performance optimization ideas, solid performance analysis and benchmarking, and rapid engineering support and implementation. By sharing these experiences, we hope to help more people who are interested in deploying large-scale LLM models on NVIDIA GPUs to run AI faster. --- # Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs by NVIDIA TensorRT LLM team ## Table of Contents - [Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs](#pushing-latency-boundaries-optimizing-deepseek-r1-performance-on-nvidia-b200-gpus) - [Table of Contents](#table-of-contents) - [Background](#background) - [Implementation Configuration](#implementation-configuration) - [Workload Profile](#workload-profile) - [Model Architecture](#model-architecture) - [Precision Strategy](#precision-strategy) - [Parallelism Strategy](#parallelism-strategy) - [Everything in One Diagram](#everything-in-one-diagram) - [Key Optimizations](#key-optimizations) - [System Level optimizations](#system-level-optimizations) - [CUDA Graph \& Programmatic Dependent Launch](#cuda-graph--programmatic-dependent-launch) - [MTP](#mtp) - [Autoregressive MTP Layers](#autoregressive-mtp-layers) - [Relax Acceptance Verification](#relax-acceptance-verification) - [Multi-streams](#multi-streams) - [Sparse Experts as GEMMs (only works when moe\_backend=CUTLASS)](#sparse-experts-as-gemms-only-works-when-moe_backendcutlass) - [Re-balanced the sparse experts](#re-balanced-the-sparse-experts) - [Mixed ETP](#mixed-etp) - [Smart Router](#smart-router) - [Kernel Level optimizations](#kernel-level-optimizations) - [Attention Kernel](#attention-kernel) - [Grouped GEMM](#grouped-gemm) - [CUTLASS Backend (default backend)](#cutlass-backend-default-backend) - [TRTLLM Backend](#trtllm-backend) - [Communication Kernel](#communication-kernel) - [Dense GEMM optimization](#dense-gemm-optimization) - [Fuse\_A\_GEMM](#fuse_a_gemm) - [RouterGEMM](#routergemm) - [Kernel fusion](#kernel-fusion) - [How to reproduce](#how-to-reproduce) - [Future Works](#future-works) - [Acknowledgment](#acknowledgment) ## Background Recent advancements in Large Language Reasoning Models have demonstrated remarkable success, while creating new deployment challenges. A critical challenge emerges from extended Output Sequence Lengths (OSL) due to complex "thinking and reasoning" processes. Longer OSL demands stricter Token-to-Token Latency (TTL) requirements, often forcing concurrency limitations. The most extreme case, single concurrency (min-latency scenario) , becomes particularly challenging for real-time applications. This article explores how TensorRT LLM achieves record-breaking performance for [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) in min-latency scenarios on NVIDIA's 8×B200 GPU configuration progressing from 67 tokens per second (TPS) to 253 before GTC 2025(**3.7x** speed-up), and to our current number is 368 TPS (**5.5x** speed-up). ## Implementation Configuration ### Workload Profile Input Sequence Length (ISL): 1k tokens Output Sequence Length (OSL): 2k tokens ### Model Architecture The base DeepSeek-R1 main model contains: 3x dense layers (initial) and 58x MoE layers, there is also 1x Multi-Tokens Prediction (MTP) layer (MoE-architecture equivalent) for speculative decoding. Our optimized configuration extends the MTP layer to 3x layers using autoregressive styling for peak performance exploration. tech_blog1_model_overview ### Precision Strategy We have explored a mixed precision recipe, which provides a better tradeoff between accuracy and performance. | Component | Precision | |:-------------------------------------:|:---------:| | 64x Attention Modules | bf16* | | 3x Dense FFN Layers | nvfp4** | | 58x MoE FFN Layers | nvfp4 | | 3x MTP Layers | bf16 | | RouterGEMM*** | bf16 | *TensorRT LLM already supports [FP8 Attention](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#fp8-kv-cache-and-mla) while for this latency scenario low-precision attention computation doesn't help with performance so we choose to use bf16 precision for the Attention Modules. ** nvfp4 model checkpoint is generated by the [NVIDIA Model Optimizer toolkit](https://github.com/NVIDIA/Model-Optimizer). *** RouterGEMM uses bf16 inputs/weights with fp32 outputs for numerical stability ### Parallelism Strategy We have also explored and introduced mixed parallel strategy on 8xB200 GPUs. Specifically, the best strategy for this latency scenario is 'TP8EP2', the definition represents | Component | Parallelism Patterns | |:---------------------:|:--------------------------------------------------------:| | Attention Modules | Tensor Parallelism 8 (TP8) | | MoE Sparse Experts | Mixed TP4 with Expert Parallelism 2 (EP2) | | MoE Shared Experts | TP8 | | Fuse_A GEMM | Data Parallelism 8 (DP8) | | RouterGEMM | DP8 | ### Everything in One Diagram Now let's put everything into one diagram, which represents a MoE layer from a decoding iteration. tech_blog1_model_details The modules in the diagram are: - Input Module: A BF16 tensor with shape [m, 7168], where m is the number of tokens (for instance, m = 4 when using three MTP layers), and 7168 is the model's hidden size. - Module1: Fuse_A_GEMM Concatenates the weights for [WDQ, WDKV, and WKR](https://arxiv.org/pdf/2412.19437) to reduce kernel launch overhead. - Module2: 2× RMSNorm Performs normalization for Q/K tensors. These can be either overlapped on multiple streams or fused into a single grouped RMSNorm. - Module3: UQ_QR_GEMM Concatenates WUQ and WQR weights to reduce kernel launch overhead. - Module4: UK_BGEMM Uses WUK in a batched GEMM. We avoid absorbing Modules 3 and 4 to prevent weight-size inflation and extra loading costs. - Module5: Concat KVCache & applyRope Merges K/V cache and applies ROPE (Rotary Positional Encoding). - Module6: genAttention Performs MLA during generation, acting like an MQA with num_q_heads = 128 / TP8 = 16. - Module7: UV_GEMM Executes a batched GEMM with WUV weights. - Module8: WO_GEMM Runs a dense GEMM using WO weights. We do not absorb Modules 7 and 8 to avoid increased weight loading overhead. - Module9: Fused Kernels Incorporates oneshotAllReduce, Add_RMSNorm, and DynamicQuant (BF16->NVFP4) in a single kernel. - Module10: routerGEMM & topK Handles the router GEMM and topK selection. - Module11: Shared Expert Overlaps partially with Module10 and Module 12. - Module12: Sparse Experts Implements expert layers via grouped GEMM. - Module13: Final Fused Kernels Performs localReduction, oneshotAllReduce, and Add_RMSNorm operations together. ## Key Optimizations | Feature | TPS/User | Code Links / Notes | |:----------------------------------------------------------|:--------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------| | Baseline: CUDA Graph + EP8TP8 | 67 | [modeling_deepseekv3.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py) | | Multi Stream to overlap shared expert with sparse experts | 73 | [modeling_deepseekv3.py#L506](https://github.com/NVIDIA/TensorRT-LLM/blob/14bfb5e0d6e81aec3306a1324cf074566646f886/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L506) | | Optimize MLA Kernel | 80 | [PR #3763](https://github.com/NVIDIA/TensorRT-LLM/pull/3763) | | Optimize TopK Kernels | 84 | • [RoutingKernelTopK.cuh](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh)
• [noAuxTcKernels.cu](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/noAuxTcKernels.cu) | | Optimize Fuse_A_GEMM | 89 | [attention.py#L345](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/modules/attention.py#L345) | | MTP3_Vanilla | 154 | evolve to MTP3_Autoregressive | | Evolve to MTP3_Autoregressive + Optimize Router GEMM | 164 | [modeling_deepseekv3.py#L304](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L304) | | Fuse oneshotAR + RMSNorm | 168 | [allReduceFusionKernels.cu#L440](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu#L440) | | Enable PDL | 173 | Set environment variable: `export TRTLLM_ENABLE_PDL=1` | | Multi-stream to overlap two RMS_norms | 180 | [attention.py#L546](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/modules/attention.py#L546) | | MTP3_Autoregressive | 204 | [modeling_deepseekv3.py#L823](https://github.com/NVIDIA/TensorRT-LLM/blob/d6b741ddfe7f8a80718c10d49773c42abc0a254f/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L823) | | Finetune clock/power | 211 | `sudo nvidia-smi -pm 0; sudo nvidia-smi -pm 1; sudo nvidia-smi boost-slider --vboost 4` | | Optimize CUTLASS Grouped GEMM Kernels | 236 | The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future.| | Optimize CUTLASS Flow: Sparse Experts as GEMMs | 249 | The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future.| | Introduce EP4TP2 for better workload balance | 253 | Use `--tp 8 --ep 4` when benchmarking | | Introduce moe_backend=TRTLLM, EP2TP4 for better balance | 299 | [PR #4280](https://github.com/NVIDIA/TensorRT-LLM/pull/4280) | | Optimize Fuse_A_GEMM and Router_GEMM | 340 | WIP | | Relax Acceptance | **368** | [deepseek_v3#multi-token-prediction-mtp](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#multi-token-prediction-mtp) | ### System Level optimizations #### CUDA Graph & Programmatic Dependent Launch [CUDA Graph](https://developer.nvidia.com/blog/cuda-graphs/) is necessary to overcome the CPU-overhead for small workloads, while [Programmatic Dependent Launch](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Programmatic%2520Dependent%2520Launch#programmatic-dependent-launch-and-synchronization) can be used to reduce the kernel launch latency furthermore. #### MTP There are two optimizations based on MTP ##### Autoregressive MTP Layers | Version | Acceptance Rate | TPS/User | TPS/User Speedup | |:-----------:|:---------------:|:--------:|:----------------:| | Without MTP | 1.00 | 111 | 1.00 | | MTP 1 | 1.92 | 198 | 1.78 | | MTP 2 | 2.58 | 250 | 2.25 | | MTP 3 | 2.82 | 253 | 2.28 | | MTP 4 | 2.99 | 245 | 2.21 | | MTP 5 | 3.01 | 239 | 2.15 | Based on our exploration, 3x MTP layers configuration demonstrates optimal performance. ##### Relax Acceptance Verification For the reasoning model (such as DeepSeek R1), the generation may consist of two phases: thinking phase and actual output. During the thinking phase, when relaxed acceptance is enabled, the draft token can be accepted when it is in a candidate set. This candidate is generated based on the logits topN and probability threshold. - topN: The topN tokens are sampled from logits. - Probability threshold. Based on topN candidates, only those tokens with a probability greater than the Top1's probability - delta can remain in the candidate set. During the non-thinking phase, we still use strict acceptance. | Version | Acceptance Rate | TPS/User Speedup | |:------------------:|:--------------:|:----------------:| | MTP3_top1, d0.0 | 2.82 | 1.00 | | MTP3_top10, d0.5 | 3.06 | 1.08 | | MTP3_top10, d0.6 | 3.10 | 1.09 | | MTP3_top15, d0.5 | 3.07 | 1.08 | This is a relaxed way of verification and comparison, which can improve the acceptance rate and bring positive speedup with limited influence on accuracy. | Dataset | Test Size | w/o Relaxed accuracy | w/ Relaxed accuracy | |:-------------------------:|:---------:|:----------:|:----------:| | MMLU-Pro | 12,032 | 84.0% | 81.2% | | Humanity's Last Exam | 2,684 | 9.0% | 9.0% | | GPQA Diamond | 198 | 71.0% | 69.2% | | MATH-500 | 500 | 96.0% | 96.2% | | AIME 2024 | 30 | 68.0% | 74.0% | | SciCode | 338 | 36.0% | 39.0% | | LiveCodeBench | 315 | 62.0% | 66.0% | For more information, please visit [multi-token-prediction-mtp](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#multi-token-prediction-mtp) #### Multi-streams We have introduced multi-streams based optimizations to hide some kernels' overhead, such as: - Overlap shared experts with sparse experts - Overlap Concat_KVCache kernel with GEMM #### Sparse Experts as GEMMs (only works when moe_backend=CUTLASS) tech_blog1_sparse_exp_as_a_gemm The existing CUTLASS-based Sparse Experts flow (illustrated in the figure) dispatches input tokens to their designated experts, then applies indexed local reduction on each expert's outputs before a global allreduce. Both dispatching and indexed local reduction incur high overhead in low-latency scenarios. To address this, we propose treating "Sparse Experts as GEMMs" by sending all tokens to each activated expert and masking out unneeded outputs before local reduction. Because grouped GEMMs are memory-bound, the extra computations from redundant tokens have minimal impact, effectively eliminating the costly dispatch and reduction overhead. #### Re-balanced the sparse experts For sparse experts, two parallelization strategies are commonly used: Expert Parallel (EP) and Tensor Parallel (TP). Expert Parallel (EP) maps each expert to a distinct GPU, achieving high memory and computational efficiency. However, token placement is data-dependent, distributing workloads unevenly across GPUs and revealing overhead in the AllReduce step after the MoE module. Tensor Parallel (TP) shards each expert evenly across GPUs, creating a balanced workload but sacrificing math/memory efficiency. ##### Mixed ETP A combined EP/TP approach can mitigate both challenges. In practice, our experiments show that a configuration of TP4EP2 offers the best performance. ##### Smart Router Alternatively, by storing all expert weights on a cluster of four GPUs and replicating them to another four-GPU cluster, a smart router can dynamically dispatch tokens across each cluster. This design keeps balanced workload distribution even without significantly impacting local memory and computation efficiency. ### Kernel Level optimizations #### Attention Kernel We have developed a customized MLA attention kernel to better utilize GPU resources for latency scenarios. #### Grouped GEMM ##### CUTLASS Backend (default backend) Our default MoE backend is based on CUTLASS, which is flexible/robust but may not be the best performance case. ##### TRTLLM Backend The other MoE backend is TRTLLM, which provides better performance, and we are working to make it more flexible and robust, and in the future it will be switched as the default backend for Grouped GEMM computation for latency scenarios. #### Communication Kernel For small message sizes, regular NCCL latency-bound AllReduce kernels are inefficient, so we've developed a customized oneshot AllReduce kernel. It leverages the powerful NVSwitch HW capability by acting like an initial broadcast followed by local reduction, delivering better performance in min-latency scenarios. #### Dense GEMM optimization We focus on optimizing two kinds of dense GEMMs: Fuse_A_GEMM and RouterGEMM, because they dominate the execution time, suffer from low memory efficiency, and cannot be easily sharded (they are DP-based). ##### Fuse_A_GEMM We developed a custom Fuse_A_GEMM that prefetches the majority of its weights into shared memory (enabled by PDL and overlapped with oneshot-AllReduce), significantly enhancing performance. The kernel shows substantial improvements over default GEMM implementation when num_tokens < 16. tech_blog1_fuse_a_gemm ##### RouterGEMM By leveraging our internal AI code generator, we automatically generate an optimized RouterGEMM kernel, which delivers substantial improvements over the default GEMM implementation when num_tokens <=30. tech_blog1_router_gemm #### Kernel fusion Kernel fusion is necessary for min-latency scenario to reduce extra global memory write/read cost, and we support following fusion patterns now - Fuse two overlapped RMS_Norms into one GroupedRMSNorm - Fuse (LocalReduction) + AR+ RMS_Norm+ (Dynamic_Quant_bf16tonvfp4) into one kernel - Fuse Grouped GEMM_FC1 + dot activation (when moe_backend=TRTLLM) into one kernel ## How to reproduce https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-min-latency Of note, the Relaxed Acceptance is specific for Deepseek-R1 model, if you want to enable it, you need to set `add_generation_prompt = True` when preparing the benchmark dataset, the code demo likes ```python input_ids = tokenizer.encode(tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True), add_special_tokens=False) ``` It's also needed to set `use_relaxed_acceptance_for_thinking: true`, `relaxed_topk: 10` and `relaxed_delta: 0.6` in speculative_config. ## Future Works - More Fusions - More Overlap - More optimization of Attention Kernel - More Exploration of MTP ## Acknowledgment Pushing the performance boundaries of DeepSeek R1 for latency-sensitive applications has been a remarkable engineering journey. The optimizations detailed in this post represent an exceptional cross-functional collaboration across the entire AI technology stack - spanning kernel-level optimizations, runtime enhancements, model quantization techniques, algorithmic improvements, and systematic performance analysis and tuning. While we can't individually acknowledge every contributor, we're proud to recognize the dedicated team of engineers whose collective expertise has helped advance the state-of-the-art in TensorRT LLM performance engineering. Through this collaborative endeavor, we've developed valuable insights into maximizing GPU utilization for large language model inference. We hope that the techniques and best practices shared in this blog will empower the developer community to better leverage NVIDIA GPU capabilities in their mission-critical LLM inference applications. --- # DeepSeek R1 MTP Implementation and Optimization by NVIDIA TensorRT LLM team ## Table of Contents - [DeepSeek R1 MTP Implementation and Optimization](#deepseek-r1-mtp-implementation-and-optimization) - [Table of Contents](#table-of-contents) - [MTP for inference](#mtp-for-inference) - [Background](#background) - [MTP Vanilla](#mtp-vanilla) - [MTP Eagle](#mtp-eagle) - [MTP implementation in TensorRT LLM](#mtp-implementation-in-tensorrt-llm) - [Basic Implementation](#basic-implementation) - [MTP Modules](#mtp-modules) - [Attention for MTP](#attention-for-mtp) - [How to run DeepSeek models with MTP](#how-to-run-deepseek-models-with-mtp) - [MTP optimization - Relaxed Acceptance](#mtp-optimization---relaxed-acceptance) - [Relaxed Acceptance](#relaxed-acceptance) - [How to run the DeepSeek-R1 model with Relaxed Acceptance](#how-to-run-the-deepseek-r1-model-with-relaxed-acceptance) - [Evaluation](#evaluation) - [Achieving speedup with MTP speculative decoding](#achieving-speedup-with-mtp-speculative-decoding) - [Accuracy studies for Relaxed Acceptance](#accuracy-studies-for-relaxed-acceptance) - [Future Works](#future-works) - [Tree-based speculative decoding support](#tree-based-speculative-decoding-support) - [Eagle3 support](#eagle3-support) - [Fix known issues](#fix-known-issues) - [Acknowledgment](#acknowledgment) TensorRT LLM achieves world-record inference performance for DeepSeek-R1 on NVIDIA Blackwell GPUs, where Multi-Token Prediction (MTP) delivers a significant speedup. In our [previous blog post](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md), we discussed the key optimizations that enable the outstanding inference latency of the DeepSeek-R1 model. This article dives deeper into the implementation and optimization of MTP in TensorRT LLM. ## MTP for inference Inspired by a previous [research work](https://arxiv.org/pdf/2404.19737), MTP is designed to help the DeepSeek-V3 training. It adds additional MTP modules at the end of the main model and uses them to predict additional tokens. In this way, MTP can extend the prediction scope to multiple future tokens at each position to achieve better model accuracy. During inference, those MTP modules can also be used for speculative decoding to improve the generation latency further. In this section, we will introduce the MTP speculative decoding algorithm for LLM inference. ### Background Speculative decoding is a popular technique for faster and cost-effective LLM inference. It’s based on the premise that generating multiple future tokens(especially for decode phase which is less compute bound) is more efficient than processing a single token. Speculative decoding techniques usually divide the process into a low-cost draft stage and a parallelized verification stage. The draft stage predicts draft tokens by using a small model or a subset of layers in the main model. And the verification stage uses the main model to determine how many of these draft tokens to accept, which is far more efficient than generating one token per iteration.
tech_blog2_verify_and_accept

Figure 1. Verification example

Figure 1 shows an example of how to verify and accept those draft tokens. Assuming there are a total of 5 draft tokens “ABCDE”, we will extend them to the input token “G”, and input a total of 6 tokens to the main model. After sampling, we can get six different expected tokens, then compare the expected tokens with the draft tokens and accept the longest prefix matched tokens. In this example, the tokens “ABC” are matched. Because “H” is predicted by the main model and the corresponding input token “C” is already accepted, “H” will also be accepted. In this way, we can accept four tokens in a single iteration. MTP also uses this method to verify and accept draft tokens. For the draft stage in MTP, there are two different MTP methods, MTP vanilla and MTP eagle. They can be used for different inference cases. ### MTP Vanilla
tech_blog2_mtp_vanilla

Figure 2. MTP Vanilla, where ti is the input token, di is the predicted draft token, K is the number of MTP modules, and hin is the hidden state of the n-th MTP module. Note that h0 means the hidden states of the main model. (Disclaimer: the figures adapted from the original DeepSeek V3 tech report)

MTP Vanilla method is more similar to the MTP training, and it sequentially uses different MTP modules to predict multiple draft tokens. This method can support model checkpoints with weights of multiple different MTP modules. And each MTP module will have its own KV cache. Figure 2 illustrates the MTP vanilla inference. In the context phase, assuming there are a total of four input tokens, we will get the output token $t_5$ and the hidden states after the main model forward. The output token will be appended to the input tokens, then we shift out the first token to get tokens from $t_2$ to $t_5$ as the input tokens of the first MTP module. The hidden states from the main model will be directly used as the input of the first MTP module to predict the first draft token. For the next few MTP modules, we'll append the newly generated draft token and the hidden states corresponding to the last input token to the input tokens and hidden states. Then we'll shift out the first token to prepare the inputs for the next MTP module. In this way, we can retain as much information as possible from the main model, which helps the draft layer make more accurate predictions. In the generation phase, there will be a little difference. The predicted token $t_5$ and the draft tokens will be used as inputs for the main model. After the main model forward, we will do the verification to get the accepted tokens. In this example, assuming $j$ draft tokens $d_6$~$d_{j+5}$ are accepted. Then prepare the MTP module inputs. Different from the context phase, we will prepare input IDs and hidden states of a total of $K$ tokens before the last accepted token. In this example, the last accepted token is $t_{j+6}$. Then we can get the first draft token after the first MTP module forward. For the sequential MTP modules, we can prepare their inputs in a similar way to the MTP modules in the context phase, so all of those MTP modules have the same input sequence length. After predicting all of the draft tokens, we need to evict the keys/values of those rejected draft tokens from the main model's KV cache to ensure the subsequent calculation is correct. ### MTP Eagle
tech_blog2_mtp_eagle

Figure 3. MTP Eagle, using the same notation as Figure 2

MTP Eagle can be viewed as a variant of [Eagle](https://arxiv.org/pdf/2401.15077) speculative decoding method, but only supports chain decoding now. It reuses the same MTP module and repeats multiple times to predict draft tokens. MTP Eagle supports the model checkpoint with only one MTP module. The official DeepSeek-V3 and DeepSeek-R1 have only one MTP module in their checkpoints. Another difference with MTP vanilla is the KV cache. In the MTP Eagle method, the MTP module reuses the same KV cache when predicting multiple draft tokens. Figure 3 gives an MTP Eagle example. In the context phase, the inputs of the first MTP module forward are the same as the MTP Vanilla. However, for the sequential MTP module forward, the first difference is that MTP Eagle uses the same MTP module to predict draft tokens and reuses the same KV cache. Another difference is that we only need to input the token ID and the hidden state of one token. The token is the last predicted draft token, while the hidden state is the corresponding hidden state in the last MTP module forward. In this way, we can predict total K draft tokens by using only one MTP module. In the generation phase, the verification stage is the same as MTP Vanilla. Once we get the accepted tokens, we use all of them along with their corresponding hidden states as inputs for the first MTP module forward. Unlike MTP Vanilla, which needs to store past tokens and hidden states, this approach is much easier to implement. Subsequent MTP module forwards follow the same input preparation method as the context phase. After predicting all draft tokens, we need to evict the key/value pairs of any rejected draft tokens from the main model’s KV cache. ## MTP implementation in TensorRT LLM ### Basic Implementation TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models.
tech_blog2_overall_workflow

Figure 4. MTP workflow in TensorRT LLM

Figure 4 shows the overall workflow of MTP in TensorRT LLM. Both paths share the runtime workflow, and the differences are in the MTP modules forward. In the context phase, there is no draft token in the inputs. TensorRT LLM model engine fetches the input IDs from the requests and inputs to the model engine forward to get the next token and the hidden state. Then we prepare the MTP module inputs, and the MTP modules forward the inputs to predict the draft tokens. The generation workflow is more complicated. We need to do both the verification and draft stages. The predicted new token and draft tokens are the inputs for the main model. After the main model forward, we can sample from the output logits and get the following new tokens. Then compare them with the input draft tokens to get the final accepted tokens. The verification stage will be finished here. We will use the accepted tokens and hidden states to start a new draft stage, which uses the MTP layers to predict new draft tokens for the next iteration. Finally, we need to rewind the KV cache to evict keys/values corresponding to those rejected tokens. Except for the Rewind KV Cache, all of those processes are inside the model engine forward function. In this way, we can use one model engine to support MTP inference, and it would be easier for MTP to be compatible with other features, such as CUDA graph and overlap scheduler. When enabling CUDA graph, both the verification and draft stages can be captured in one graph, significantly reducing CPU overhead. ### MTP Modules
tech_blog2_mtp_modules

Figure 5. MTP model architecture

Figure 5 introduces the basic model architecture of [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326), [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047), and the basic [MTP module](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L829) design. Because MTP vanilla needs $K$ input tokens, if the number of accepted tokens is less than the number of input tokens, i.e. $j