# Autogptq
> On Linux and Windows, AutoGPTQ can be installed through pre-built wheels for specific PyTorch versions:
---
# Installation
On Linux and Windows, AutoGPTQ can be installed through pre-built wheels for specific PyTorch versions:
| AutoGPTQ version | CUDA/ROCm version | Installation | Built against PyTorch |
|------------------|-------------------|------------------------------------------------------------------------------------------------------------|-----------------------|
| latest (0.7.1) | CUDA 11.8 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.1+cu118 |
| latest (0.7.1) | CUDA 12.1 | `pip install auto-gptq` | 2.2.1+cu121 |
| latest (0.7.1) | ROCm 5.7 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm571/` | 2.2.1+rocm5.7 |
| 0.7.0 | CUDA 11.8 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.0+cu118 |
| 0.7.0 | CUDA 12.1 | `pip install auto-gptq` | 2.2.0+cu121 |
| 0.7.0 | ROCm 5.7 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm571/` | 2.2.0+rocm5.7 |
| 0.6.0 | CUDA 11.8 | `pip install auto-gptq==0.6.0 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.1.1+cu118 |
| 0.6.0 | CUDA 12.1 | `pip install auto-gptq==0.6.0` | 2.1.1+cu121 |
| 0.6.0 | ROCm 5.6 | `pip install auto-gptq==0.6.0 --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm561/` | 2.1.1+rocm5.6 |
| 0.5.1 | CUDA 11.8 | `pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.1.0+cu118 |
| 0.5.1 | CUDA 12.1 | `pip install auto-gptq==0.5.1` | 2.1.0+cu121 |
| 0.5.1 | ROCm 5.6 | `pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm561/` | 2.1.0+rocm5.6 |
AutoGPTQ is not available on macOS.
---
##
News or Update
- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support.
- 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!
- 2023-08-21 - (News) - Team of Qwen officially released 4bit quantized version of Qwen-7B based on `auto-gptq`, and provided [a detailed benchmark results](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization)
- 2023-08-06 - (Update) - Support exllama's q4 CUDA kernel to have at least 1.3x speed up for int4 quantized models when doing inference.
- 2023-08-04 - (Update) - Support RoCm so that AMD GPU users can use auto-gptq with CUDA extensions.
- 2023-07-26 - (Update) - An elegant [PPL benchmark script](examples/benchmark/perplexity.py) to get results that can be fairly compared with other libraries such as `llama.cpp`.
- 2023-06-05 - (Update) - Integrate with 🤗 peft to use gptq quantized model to train adapters, support LoRA, AdaLoRA, AdaptionPrompt, etc.
- 2023-05-30 - (Update) - support download/upload quantized model from/to 🤗 Hub.
- 2023-05-27 - (Update) - Support quantization and inference for `gpt_bigcode`, `codegen` and `RefineWeb/RefineWebModel`(falcon) model types.
- 2023-05-04 - (Update) - Support using faster cuda kernel when `not desc_act or group_size == -1`
- 2023-04-29 - (Update) - Support loading quantized model from arbitrary quantize_config and model_basename.
- 2023-04-28 - (Update) - Support CPU offload and quantize/inference on multiple devices, support `gpt2` type models.
- 2023-04-26 - (Update) - Using `triton` to speed up inference is now supported.
- 2023-04-25 - (News&Update) - [MOSS](https://github.com/OpenLMLab/MOSS) is an open-source tool-augmented conversational language model from Fudan University, quantization is now supported in AutoGPTQ.
- 2023-04-23 - (Update) - Support evaluation on multiple (down-stream) tasks such as: language-modeling, text-classification, text-summarization.
- 2023-04-22 - (News) - qwopqwop200's [AutoGPTQ-triton](https://github.com/qwopqwop200/AutoGPTQ-triton) provides faster speed to integrate with quantized model, for everyone who can access to triton, try and enjoy yourself!
- 2023-04-20 - (News) - AutoGPTQ is automatically compatible with Stability-AI's newly released `gpt_neox` type model family [StableLM](https://github.com/Stability-AI/StableLM).
- 2023-04-16 - (Update) - Support quantization and inference for `bloom`, `gpt_neox`, `gptj`, `llama` and `opt`.
---
# Quick Start
Welcome to the tutorial of AutoGPTQ, in this chapter, you will learn quick install `auto-gptq` from pypi and the basic usages of this library.
## Quick Installation
Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
```shell
pip install auto-gptq
```
AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using:
```shell
pip install auto-gptq[triton]
```
For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
```shell
pip install auto-gptq[llama]
```
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
```shell
BUILD_CUDA_EXT=0 pip install auto-gptq
```
For Windows
```shell
set BUILD_CUDA_EXT=0 && pip install auto-gptq
```
## Basic Usage
*The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`*
The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
```
### Quantize a pretrained model
To quantize a model, you need to load pretrained model and tokenizer first, for example:
```python
from transformers import AutoTokenizer
pretrained_model_name = "facebook/opt-125m"
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
```
This will download `opt-125m` from 🤗 Hub and cache it to local disk, then load into **CPU memory**.
*In later tutorial, you will learn advanced model loading strategies such as CPU offload and load model into multiple devices.*
Then, prepare examples(a list of dict with only two keys, 'input_ids' and 'attention_mask') to guide quantization. Here we use only one text to simplify the code, but you should be noticed that the more examples used, the better(most likely) the quantized model.
```python
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
```
After all recipes are prepared, we can now start to quantize the pretrained model.
```python
model.quantize(examples)
```
Finally, we can save the quantized model:
```python
quantized_model_dir = "opt-125m-4bit-128g"
model.save_quantized(quantized_model_dir)
```
By default, the saved file type is `.bin`, you can also set `use_safetensors=True` to save a `.safetensors` model file. The format of model file base name saved using this method is: `gptq_model-{bits}bit-{group_size}g`.
Pretrained model's config and the quantize config will also be saved with file names `config.json` and `quantize_config.json`, respectively.
### Load quantized model and do inference
Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
```python
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device)
```
This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first visible GPU.
Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
```python
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
```
## Conclusion
Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.
---
# Advanced Model Loading and Best Practice
Welcome to the tutorial of AutoGPTQ, in this chapter, you will learn advanced model loading and best practice in `auto-gptq`.
## Arguments Introduction
In previous chapter, you learned how to load model into CPU or single GPU with the two basic apis:
- `.from_pretrained`: by default, load the whole pretrained model into CPU.
- `.from_quantized`: by default, `auto_gptq` will automatically find the suitable way to load the quantized model.
- if there is only single GPU and model can fit into it, will load the whole model into that GPU;
- if there are multiple GPUs and model can fit into them, will evenly split model and load into those GPUs;
- if model can't fit into GPU(s), will use CPU offloading.
However, the default settings above may not meet many users' demands, for they want to have more control of model loading.
Luckily, in AutoGPTQ, we provide some advanced arguments that users can tweak to manually config model loading strategy:
- `low_cpu_mem_usage`: `bool` type argument, defaults to False, can be used both in `.from_pretrained` and `.from_quantized`, one can enable it when there is a limitation of CPU memory(by default model will be initialized in CPU) or want to load model faster.
- `max_memory`: an optional `List[Dict[Union[str, int], str]]` type argument, can be used both in `.from_pretrained` and `.from_quantized`.
- `device_map`: an optional `Union[str, Dict[str, Union[int, str]]]` type argument, currently only be supported in `.from_quantized`.
Before `auto-gptq`'s existence, there are many users have already used other popular tools such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) to quantize their model and saved with different name without `quantize_config.json` file introduced in previous chapter.
To address this, two more arguments were introduced in `.from_quantized` so that users can load quantized model with arbitrary names.
- `quantize_config`: an optional `BaseQuantizeConfig` type argument, can be used to match model file and initialize model incase `quantize_config.json` not in the directory where model is saved.
- `model_basename`: an optional `str` type argument, if specified, will be used to match model instead of using the file name format introduced in previous chapter.
## Multiple Devices Model Loading
### max_memory
With this argument, you can specify how much memory for CPU and GPUs to use at most.
That means, by specify the maximum CPU memory used at model loading, you can load some model weights to CPU and picked into GPU only when they're required to be used, and back CPU again after that. This is called "CPU offload", a very useful strategy that used when there is no room left for quantization or inference if you keep the whole model in GPU(s).
Assume you have multiple GPUs, for each of them, you can also specify maximum memory that used to load model, separately. And by this, quantization and inference will be executed across devices.
To better understanding, below are some examples.
```python
max_memory = {0: "20GIB"}
```
In this case, only first GPU (even if you have more GPUs) will be used to load model, and an error will be raised if the model requires memory over 20GB.
```python
max_memory = {0: "20GIB", 1: "20GIB"}
```
In this case, you can load model that smaller than 40GB into two GPUs, and the model will be split evenly.
```python
max_memory = {0: "10GIB", 1: "30GIB"}
```
In this case, you can also load model that smaller than 40GB into two GPUs, but the first GPU will use 10GB at most, which means if the model larger than 20GB, all model weights except the first 10GB will be loaded into the second GPU.
```python
max_memory = {0: "20GIB", "cpu": "20GIB"}
```
In this case, you can also load model that smaller than 40GB but the rest 20GB will be kept in CPU memory, only be collected into GPU when needed.
### device_map
So far, only `.from_quantized` supports this argument.
You can provide a string to this argument to use pre-set model loading strategies. Current valid values are `["auto", "balanced", "balanced_low_0", "sequential"]`
In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate handle the device map computation. For more details of this argument, you can reference to [this document](https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).
## Best Practice
### At Quantization
It's always recommended to first consider loading the whole model into GPU(s) for it can save the time spend on transferring module's weights between CPU and GPU.
However, not everyone have large GPU memory. Roughly speaking, always specify the maximum memory CPU will be used to load model, then, for each GPU, you can preserve memory that can fit in 1\~2(2\~3 for the first GPU incase CPU offload used) model layers for examples' tensors and calculations in quantization, and load model weights using all others left. By this, all you need to do is a simple math based on the number of GPUs you have, the size of model weights file(s) and the number of model layers.
### At Inference
For inference, following this principle: always using single GPU if you can, otherwise multiple GPUs, CPU offload is the last one to consider.
## Conclusion
Congrats! You learned the advanced strategies to load model using `.from_pretrained` and `.from_quantized` in `auto-gptq` with some best practice advices. In the next chapter, you will learn how to quickly customize an AutoGPTQ model and use it to quantize and inference.