# Lmdeploy

> .. autoclass:: PytorchEngineConfig

## Pages

- [inference pipeline](api-pipeline.md): inference pipeline
- [On Other Platforms](get-started.md): On Other Platforms
- [Welcome to LMDeploy's tutorials!](index.md): Welcome to LMDeploy's tutorials!
- [Vision-Language Models](multi-modal.md): Vision-Language Models
- [Customized chat template](advance-chat-template.md): The effect of the applied chat template can be observed by **setting log level**`INFO`.
- [Context Parallel](advance-context-parallel.md): When the memory on a single GPU is insufficient to deploy a model, it is often deployed using tensor parallelism (TP)...
- [How to debug Turbomind](advance-debug-turbomind.md): Turbomind is implemented in C++, which is not as easy to debug as Python. This document provides basic methods for de...
- [Context length extrapolation](advance-long-context.md): Long text extrapolation refers to the ability of LLM to handle data longer than the training text during inference. T...
- [Production Metrics](advance-metrics.md): LMDeploy exposes a set of metrics via Prometheus, and provides visualization via Grafana.
- [PyTorchEngine Multi-Node Deployment Guide](advance-pytorch-multinodes.md): To support larger-scale model deployment requirements, PyTorchEngine provides multi-node deployment support. Below ar...
- [PyTorchEngine Multithread](advance-pytorch-multithread.md): We have removed`thread_safe`mode from PytorchEngine since [PR2907](https://github.com/InternLM/lmdeploy/pull/2907)....
- [lmdeploy.pytorch New Model Support](advance-pytorch-new-model.md): lmdeploy.pytorch is designed to simplify the support for new models and the development of prototypes. Users can adap...
- [PyTorchEngine Profiling](advance-pytorch-profiling.md): We provide multiple profiler to analysis the performance of PyTorchEngine.
- [Speculative Decoding](advance-spec-decoding.md): Speculative decoding is an optimization technique that introcude a lightweight draft model to propose multiple next t...
- [Structured output](advance-structed-output.md): Structured output, also known as guided decoding, forces the model to generate text that exactly matches a user-suppl...
- [Update Weights](advance-update-weights.md): LMDeploy supports update model weights online for scenes such as RL training. Here are the steps to do so.
- [TurboMind Benchmark on A100](benchmark-a100-fp16.md): All the following results are tested on A100-80G(x8) CUDA 11.8.
- [Benchmark](benchmark-benchmark.md): Please install the lmdeploy precompiled package and download the script and the test dataset:
- [Model Evaluation Guide](benchmark-evaluate-with-opencompass.md): This document describes how to evaluate a model's capabilities on academic datasets using OpenCompass and LMDeploy. T...
- [Multi-Modal Model Evaluation Guide](benchmark-evaluate-with-vlmevalkit.md): This document describes how to evaluate multi-modal models' capabilities using VLMEvalKit and LMDeploy.
- [FAQ](faq.md): There is probably a cached mmengine in your local host. Try to install its latest version.
- [Get Started with Huawei Ascend](get-started-ascend-get-started.md): We currently support running lmdeploy on **Atlas 800T A3, Atlas 800T A2 and Atlas 300I Duo**.
- [Cambricon](get-started-camb-get-started.md): The usage of lmdeploy on a Cambricon device is almost the same as its usage on CUDA with PytorchEngine in lmdeploy.
- [Quick Start](get-started-get-started.md): This tutorial shows the usage of LMDeploy on CUDA platform:
- [Installation](get-started-installation.md): LMDeploy is a python library for compressing, deploying, and serving Large Language Models(LLMs) and Vision-Language ...
- [MetaX-tech](get-started-maca-get-started.md): The usage of lmdeploy on a MetaX-tech device is almost the same as its usage on CUDA with PytorchEngine in lmdeploy.
- [Load huggingface model directly](inference-load-hf.md): Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them fr...
- [Architecture of lmdeploy.pytorch](inference-pytorch.md): `lmdeploy.pytorch`is an inference engine in LMDeploy that offers a developer-friendly framework to users interested ...
- [Architecture of TurboMind](inference-turbomind.md): TurboMind is an inference engine that supports high throughput inference for conversational LLMs. It's based on NVIDI...
- [TurboMind Config](inference-turbomind-config.md): TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the i...
- [OpenAI Compatible Server](llm-api-server.md): This article primarily discusses the deployment of a single LLM model across multiple GPUs on a single node, providin...
- [Serving LoRA](llm-api-server-lora.md): LoRA is currently only supported by the PyTorch backend. Its deployment process is similar to that of other models, a...
- [Reasoning Outputs](llm-api-server-reasoning.md): For models that support reasoning capabilities, such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)...
- [Tools Calling](llm-api-server-tools.md): LMDeploy supports tools for InternLM2, InternLM2.5, llama3.1 and Qwen2.5 models. Please use`--tool-call-parser`to s...
- [codellama](llm-codellama.md): [codellama](https://github.com/facebookresearch/codellama) features enhanced coding capabilities. It can generate cod...
- [Offline Inference Pipeline](llm-pipeline.md): In this tutorial, We will present a list of examples to introduce the usage of`lmdeploy.pipeline`.
- [Request Distributor Server](llm-proxy-server.md): The request distributor service can parallelize multiple api_server services. Users only need to access the proxy URL...
- [OpenAI Compatible Server](multi-modal-api-server-vl.md): This article primarily discusses the deployment of a single large vision language model across multiple GPUs on a sin...
- [CogVLM](multi-modal-cogvlm.md): CogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B models like [THUDM/cogvlm-...
- [DeepSeek-VL2](multi-modal-deepseek-vl2.md): DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves...
- [Gemma3](multi-modal-gemma3.md): Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technolo...
- [InternVL](multi-modal-internvl.md): LMDeploy supports the following InternVL series of models, which are detailed in the table below:
- [LLaVA](multi-modal-llava.md): LMDeploy supports the following llava series of models, which are detailed in the table below:
- [MiniCPM-V](multi-modal-minicpmv.md): LMDeploy supports the following MiniCPM-V series of models, which are detailed in the table below:
- [Mllama](multi-modal-mllama.md): [Llama3.2-VL](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) is a family of large l...
- [Molmo](multi-modal-molmo.md): LMDeploy supports the following molmo series of models, which are detailed in the table below:
- [Phi-3 Vision](multi-modal-phi3.md): [Phi-3](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) is a family of small language an...
- [Qwen2.5-VL](multi-modal-qwen2-5-vl.md): LMDeploy supports the following Qwen-VL series of models, which are detailed in the table below:
- [Qwen2-VL](multi-modal-qwen2-vl.md): LMDeploy supports the following Qwen-VL series of models, which are detailed in the table below:
- [Offline Inference Pipeline](multi-modal-vl-pipeline.md): LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipe...
- [InternLM-XComposer-2.5](multi-modal-xcomposer2d5.md): [InternLM-XComposer-2.5](https://github.com/InternLM/InternLM-XComposer) excels in various text-image comprehension a...
- [INT4/INT8 KV Cache](quantization-kv-quant.md): Since v0.4.0, LMDeploy has supported **online** key-value (kv) cache quantization with int4 and int8 numerical precis...
- [AWQ/GPTQ](quantization-w4a16.md): LMDeploy TurboMind engine supports the inference of 4bit quantized models that are quantized both by [AWQ](https://ar...
- [SmoothQuant](quantization-w8a8.md): LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For G...
- [Reward Models](supported-models-reward-models.md): LMDeploy supports reward models, which are detailed in the table below:
- [Supported Models](supported-models-supported-models.md): The following tables detail the models supported by LMDeploy's TurboMind engine and PyTorch engine across different p...