# Smoothquant

> Main module for applying SmoothQuant to models.

---

# SmoothQuant API Reference

## Core Modules

### smoothquant.smooth

Main module for applying SmoothQuant to models.

#### `smooth_lm(model, scales, alpha=0.5)`

Apply SmoothQuant smoothing to a model.

**Parameters:**
- `model` (nn.Module): Language model to smooth
- `scales` (dict): Activation scales per layer
- `alpha` (float): Smoothing factor, range [0, 1], default 0.5

**Returns:** None (modifies model in-place)

**Example:**
```python
from smoothquant.smooth import smooth_lm
smooth_lm(model, activation_scales, alpha=0.85)
```

### smoothquant.fake_quant

Provides fake (simulated) quantization functions for testing.

#### `fake_quantize_activation_per_token_asymmetric(activation, scales)`

Simulate asymmetric per-token INT8 activation quantization.

**Parameters:**
- `activation` (torch.Tensor): Activation tensor [seq_len, hidden_dim]
- `scales` (torch.Tensor): Per-token scales [seq_len]

**Returns:** Quantized activation tensor

**Example:**
```python
from smoothquant.fake_quant import fake_quantize_activation_per_token_asymmetric
quant_act = fake_quantize_activation_per_token_asymmetric(activation, scales)
```

#### `fake_quantize_weight_per_channel_symmetric(weight, scales)`

Simulate symmetric per-channel INT8 weight quantization.

**Parameters:**
- `weight` (torch.Tensor): Weight tensor [out_features, in_features]
- `scales` (torch.Tensor): Per-channel scales [out_features]

**Returns:** Quantized weight tensor

### smoothquant.opt

Quantized OPT model implementation.

#### `Int8OPTForCausalLM`

OPT model with INT8 quantization.

**Methods:**
- `from_pretrained(model_id)`: Load pre-quantized model from Hugging Face
- `generate(**kwargs)`: Generate text (standard Transformers API)
- `forward(input_ids, attention_mask, ...)`: Standard forward pass

**Example:**
```python
from smoothquant.opt import Int8OPTForCausalLM
model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant")
output = model.generate(input_ids=input_ids)
```

### smoothquant.calibration

Calibration and scale computation.

#### `compute_activation_scales(model, data_loader, num_samples=512)`

Compute activation scales for quantization.

**Parameters:**
- `model` (nn.Module): Model to calibrate
- `data_loader` (DataLoader): Calibration data loader
- `num_samples` (int): Number of calibration samples

**Returns:** dict of activation scales per layer

**Example:**
```python
from smoothquant.calibration import compute_activation_scales
scales = compute_activation_scales(model, calib_loader, num_samples=512)
```

### smoothquant.ppl_eval

Perplexity evaluation for quantized models.

#### `evaluate_ppl(model, dataset, scales, alpha=0.5, smooth=True, quantize=True)`

Evaluate perplexity of quantized model.

**Parameters:**
- `model` (nn.Module): Model to evaluate
- `dataset` (str): Dataset name ('wikitext-2', 'wikitext-103', etc.)
- `scales` (dict): Activation scales
- `alpha` (float): Smoothing factor
- `smooth` (bool): Whether to apply smoothing
- `quantize` (bool): Whether to apply fake quantization

**Returns:** float (perplexity value)

## Command-line Tools

### generate_act_scales.py

Generate activation channel scales for a model.

**Usage:**
```bash
python examples/generate_act_scales.py \
    --model-name <model_id> \
    --output-path <output_file> \
    --num-samples 512 \
    --seq-len 512
```

**Arguments:**
- `--model-name`: Hugging Face model ID or local path
- `--output-path`: Where to save scales (.pt file)
- `--num-samples`: Number of calibration samples (default: 512)
- `--seq-len`: Sequence length (default: 512)
- `--dataset-path`: Path to calibration dataset
- `--per-gpu-batch-size`: Batch size per GPU (default: 8)

### export_int8_model.py

Export a model quantized with SmoothQuant.

**Usage:**
```bash
python examples/export_int8_model.py \
    --model_path <model_id> \
    --output_path <output_dir> \
    --act_scales_path <scales_file> \
    --alpha 0.85
```

**Arguments:**
- `--model_path`: Original model (Hugging Face ID or path)
- `--output_path`: Output directory for quantized model
- `--act_scales_path`: Path to activation scales file
- `--alpha`: Smoothing factor (default: 0.5)

### ppl_eval.py

Evaluate perplexity of quantized models.

**Usage:**
```bash
python smoothquant/ppl_eval.py \
    --model_path <model_id> \
    --act_scales_path <scales_file> \
    --smooth \
    --alpha 0.85 \
    --quantize \
    --dataset_name wikitext \
    --dataset_config wikitext-2
```

**Arguments:**
- `--model_path`: Model to evaluate
- `--act_scales_path`: Path to activation scales
- `--smooth`: Apply smoothing (flag)
- `--alpha`: Smoothing factor
- `--quantize`: Apply fake quantization (flag)
- `--dataset_name`: Dataset for evaluation
- `--dataset_config`: Dataset configuration

## Tensor Specifications

### Activation Scales Format

Activation scales are stored as PyTorch tensors in `.pt` files.

**Structure:**
```python
scales = {
    'layer.0': torch.tensor([...]),  # Shape: [hidden_dim] or [num_heads]
    'layer.1': torch.tensor([...]),
    # ...
}
```

### Quantization Parameters

**INT8 Activation Quantization:**
- Range: [-128, 127]
- Zero-point: computed per-token
- Scale: per-token from data

**INT8 Weight Quantization:**
- Range: [-128, 127]
- Zero-point: 0 (symmetric)
- Scale: per-output-channel

## Performance Characteristics

### Computational Complexity

- Calibration: O(n*d) where n=num_samples, d=hidden_dim
- Smoothing: O(l*d) where l=num_layers
- Quantization: O(model_size)

### Memory Usage

- Original model: 100%
- After INT8 quantization: 50%
- During quantization process: 150% (temporary)

### Latency

- Typical 1.5x speedup on NVIDIA GPUs
- Varies by model size and GPU architecture
- Better results with larger batch sizes

## Error Handling

### Common Exceptions

```python
# Missing activation scales
RuntimeError: "scales not found for layer..."

# Incompatible model type
ValueError: "unsupported model architecture"

# Out of memory during calibration
RuntimeError: "CUDA out of memory"
```

## Version Compatibility

- PyTorch: >= 1.12.0
- Transformers: >= 4.36.0
- CUDA: 11.3+ recommended
- Python: 3.8+

---

# SmoothQuant Implementation Guide

## Overview

This guide covers the main workflows for using SmoothQuant to quantize Large Language Models to INT8.

## Workflow 1: Using Pre-quantized Models

The simplest approach is to use pre-quantized OPT models already available on Hugging Face.

### Load Model

```python
from smoothquant.opt import Int8OPTForCausalLM

# Load pre-quantized INT8 model
model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant")

# Generate text
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs)
```

### Available Models

- `mit-han-lab/opt-125m-smoothquant`
- `mit-han-lab/opt-1.3b-smoothquant`
- `mit-han-lab/opt-2.7b-smoothquant`
- `mit-han-lab/opt-6.7b-smoothquant`
- `mit-han-lab/opt-13b-smoothquant`
- `mit-han-lab/opt-30b-smoothquant`
- `mit-han-lab/opt-66b-smoothquant`

## Workflow 2: Quantize Your Own Model

To quantize a model not in the pre-quantized collection:

### Step 1: Generate Activation Scales

```bash
python examples/generate_act_scales.py \
    --model-name meta-llama/Llama-2-7b \
    --output-path llama2-7b-scales.pt \
    --num-samples 512 \
    --seq-len 512 \
    --dataset-path data/calibration_data
```

Parameters:
- `--model-name`: Hugging Face model ID or local path
- `--output-path`: Where to save activation scales
- `--num-samples`: Number of calibration samples (512 recommended)
- `--seq-len`: Sequence length (512 typical)
- `--dataset-path`: Path to calibration dataset (uses Pile by default)

### Step 2: Export Quantized Model

```bash
python examples/export_int8_model.py \
    --model_path meta-llama/Llama-2-7b \
    --output_path ./llama2-7b-smoothquant \
    --act_scales_path llama2-7b-scales.pt \
    --alpha 0.85
```

Parameters:
- `--model_path`: Original model (Hugging Face ID or path)
- `--output_path`: Where to save quantized model
- `--act_scales_path`: Path to activation scales from Step 1
- `--alpha`: Smoothing factor (0.8-0.9 recommended)

### Step 3: Use Quantized Model

```python
from smoothquant.opt import Int8OPTForCausalLM
from transformers import AutoTokenizer

model = Int8OPTForCausalLM.from_pretrained("./llama2-7b-smoothquant")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

inputs = tokenizer("Explain quantum computing", return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0]))
```

## Workflow 3: Evaluate Quantized Models

Evaluate perplexity of quantized models:

```bash
python smoothquant/ppl_eval.py \
    --model_path meta-llama/Llama-2-7b \
    --act_scales_path llama2-7b-scales.pt \
    --smooth \
    --alpha 0.85 \
    --quantize \
    --dataset_name wikitext \
    --dataset_config wikitext-2
```

## Smoothing Algorithm Details

### Mathematical Foundation

SmoothQuant applies per-channel scaling transformation:

```
X_smoothed = X * S
W_smoothed = W * S^(-1)
```

Where:
- `X` = activation tensor
- `W` = weight tensor
- `S` = per-channel scaling matrix

### Alpha Parameter

The alpha parameter controls the smoothing ratio:

```
S_i = max(|X_i|)^(1-α) * 2^α
```

- `α = 0.5`: Balanced smoothing (default starting point)
- `α = 0.8-0.9`: More smoothing toward weights (often better for LLMs)
- `α = 1.0`: All difficulty in weights

### Tuning Alpha

For best results with your model:

1. Start with α = 0.85
2. If accuracy drops > 1%: increase α to 0.9
3. If no improvement: decrease α to 0.8

## Performance Optimization

### For Single GPU

Fake quantization provides accurate PPL estimation without special hardware:

```python
from smoothquant.fake_quant import fake_quantize_activation_per_token_asymmetric
import torch

# Simulate INT8 inference on FP16 model
quantized_output = fake_quantize_activation_per_token_asymmetric(
    activation, 
    scales
)
```

### For Multi-GPU Inference

Use FasterTransformer for production:

```bash
# Compile FasterTransformer with INT8 support
git clone https://github.com/NVIDIA/FasterTransformer.git
cd FasterTransformer
# Build with -DENABLE_INT8=ON flag
```

Then use quantized model with FasterTransformer backend for 1.56x speedup.

### Memory Savings

- FP16 model: 1x (baseline)
- INT8 model: 0.5x (50% reduction)
- With KV cache optimization: Additional 0.5x for attention

## Pre-computed Activation Scales

Pre-computed scales available at: https://huggingface.co/mit-han-lab/smoothquant-scales

This includes scales for:
- OPT (125M - 66B)
- BLOOM
- Llama-1/2/3
- Falcon
- Mistral
- Mixtral

Use with your model:

```python
import torch

# Load pre-computed scales
scales = torch.load("path/to/scales.pt")

# Apply to your model
from smoothquant.smooth import smooth_lm
smooth_lm(model, scales, alpha=0.85)
```

## Troubleshooting

### Issue: Low accuracy after quantization

**Solution**: Try different alpha values
```bash
for alpha in 0.75 0.80 0.85 0.90; do
    python smoothquant/ppl_eval.py \
        --model_path <model> \
        --act_scales_path <scales> \
        --alpha $alpha \
        --smooth \
        --quantize
done
```

### Issue: CUDA out of memory

**Solution**: Reduce batch size during calibration
```bash
python examples/generate_act_scales.py \
    --model-name <model> \
    --num-samples 256 \
    --per-gpu-batch-size 1
```

### Issue: torch-int not available

**Solution**: Install required dependencies
```bash
pip install torch-int
# If build fails, ensure CUDA toolkit matches your PyTorch installation
```

## Integration Examples

### With Hugging Face Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load and quantize
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Apply smoothing (see examples/export_int8_model.py for full implementation)
from smoothquant.smooth import smooth_lm
smooth_lm(model, scales, alpha=0.85)
```

### With vLLM

SmoothQuant quantized models compatible with vLLM for efficient serving:

```bash
python -m vllm.entrypoints.openai.api_server \
    --model mit-han-lab/opt-30b-smoothquant \
    --quantization smoothquant
```

### With TensorRT-LLM

For production inference with maximum performance:

```bash
# Build with SmoothQuant support
trtllm-build \
    --checkpoint_dir ./quantized_model \
    --output_dir ./trt_engine \
    --quantization smoothquant
```

## References

- Paper: https://arxiv.org/abs/2211.10438
- GitHub: https://github.com/mit-han-lab/smoothquant
- Hugging Face: https://huggingface.co/mit-han-lab

---

# SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

**Source:** https://github.com/mit-han-lab/smoothquant

## Overview

SmoothQuant is a training-free, accuracy-preserving post-training quantization (PTQ) solution for Large Language Models that enables INT8 activation and weight quantization (W8A8). Developed by MIT and NVIDIA researchers, it allows efficient inference on LLMs up to 175B parameters.

### Key Features

- **Training-Free**: No fine-tuning required
- **Accuracy-Preserving**: Negligible loss in model accuracy
- **W8A8 Quantization**: Both weights and activations quantized to INT8
- **Hardware Efficient**: 1.56x speedup and 2x memory reduction
- **General Purpose**: Works with OPT, BLOOM, GLM, LLaMA, Falcon, Mistral, Mixtral models

### Technical Innovation

SmoothQuant migrates quantization difficulty from activations to weights through mathematically equivalent transformations, smoothing out activation outliers to make both weights and activations easy to quantize.

### Integration Points

- NVIDIA TensorRT-LLM (INT8 W8A8 support)
- Amazon SageMaker (LLM inference optimization)
- Intel Neural-Compressor (quantization toolkit)
- Microsoft ONNX Runtime (inference examples)
- AMD Instinct MI300X (INT8 GEMM via Composable Kernel)

## Performance Results

### Supported Models with W8A8 Quantization

| Model | Method | PPL | Alpha |
|-------|--------|-----|-------|
| Llama-2-7B | FP16 | 5.474 | - |
| | SQ W8A8 | 5.515 | 0.85 |
| Llama-2-13B | FP16 | 4.950 | - |
| | SQ W8A8 | 4.929 | 0.85 |
| Llama-2-70B | FP16 | 3.320 | - |
| | SQ W8A8 | 3.359 | 0.9 |
| Llama-3-8B | FP16 | 6.138 | - |
| | SQ W8A8 | 6.258 | 0.85 |
| Mistral-7B | FP16 | 5.253 | - |
| | SQ W8A8 | 5.277 | 0.8 |
| Mixtral-8x7B | FP16 | 3.842 | - |
| | SQ W8A8 | 3.893 | 0.8 |
| Falcon-7B | FP16 | 6.590 | - |
| | SQ W8A8 | 6.629 | 0.6 |
| Falcon-40B | FP16 | 5.228 | - |
| | SQ W8A8 | 5.255 | 0.7 |

## Installation

```bash
conda create -n smoothquant python=3.8
conda activate smoothquant
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install transformers==4.36.0 accelerate datasets zstandard
python setup.py install
```

## Quick Start

### Load Pre-quantized INT8 Model

```python
from smoothquant.opt import Int8OPTForCausalLM
model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant")
```

Available pre-quantized models: `opt-125m`, `opt-1.3b`, `opt-2.7b`, `opt-6.7b`, `opt-13b`, `opt-30b`, `opt-66b`

### Generate Activation Scales

```bash
python examples/generate_act_scales.py \
    --model-name <model_name_or_path> \
    --output-path <output_act_scales_file_path> \
    --num-samples <num_samples> \
    --seq-len <sequence_length> \
    --dataset-path <path_to_calibration_dataset>
```

### Smooth and Quantize Models

```bash
python examples/export_int8_model.py \
    --model_path <model_path> \
    --output_path <output_path> \
    --act_scales_path <act_scales_file>
```

### Evaluate Perplexity

```bash
python smoothquant/ppl_eval.py \
    --model_path <model_name_or_path> \
    --act_scales_path <act_scales_file_path> \
    --smooth \
    --alpha <alpha_value> \
    --quantize
```

## Core Components

### Main Modules

- **`smoothquant/smooth.py`**: Core smoothing algorithm implementation
- **`smoothquant/fake_quant.py`**: FP16 fake quantization simulation
- **`smoothquant/opt.py`**: Quantized OPT model class with INT8 linear layers
- **`smoothquant/calibration.py`**: Calibration and scale computation
- **`smoothquant/ppl_eval.py`**: Perplexity evaluation for quantized models

### Examples

- **`examples/smoothquant_opt_demo.ipynb`**: OPT-13B W8A8 fake quantization demo
- **`examples/smoothquant_opt_real_int8_demo.ipynb`**: OPT-30B real INT8 inference on A100
- **`examples/smoothquant_llama_demo.ipynb`**: LLaMA model quantization demo
- **`examples/generate_act_scales.py`**: Script to compute activation channel scales
- **`examples/export_int8_model.py`**: Export quantized INT8 models
- **`examples/ppl_eval.sh`**: Batch evaluation script for multiple models

## Hardware Requirements

### For Real INT8 Inference
- Requires NVIDIA GPU with INT8 support (requires torch-int library)
- CUTLASS INT8 GEMM kernels wrapped as PyTorch modules

### For Large Models (Multi-GPU)
- Recommended: Use FasterTransformer backend for distributed inference
- OPT-175B: 4 GPUs with INT8 vs 8 GPUs with FP16

## Activation Channel Scales

Pre-computed activation scales available for:
- OPT (all sizes)
- BLOOM
- Llama (1/2/3)
- Falcon
- Mistral
- Mixtral

Scales computed with 512 random sentences from Pile validation set.

## Research Paper

**SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models**
- **Authors**: Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, Song Han
- **Conference**: ICML 2023
- **Paper**: https://arxiv.org/abs/2211.10438
- **Slides**: See assets/SmoothQuant.pdf in repository

## Citation

```bibtex
@InProceedings{xiao2023smoothquant,
    title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models},
    author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song},
    booktitle = {Proceedings of the 40th International Conference on Machine Learning},
    year = {2023}
}
```

## References

- **GitHub Repository**: https://github.com/mit-han-lab/smoothquant
- **Hugging Face Models**: https://huggingface.co/mit-han-lab/
- **Paper**: https://arxiv.org/abs/2211.10438
- **torch-int**: https://github.com/Guangxuan-Xiao/torch-int
- **CUTLASS**: https://github.com/NVIDIA/cutlass
- **FasterTransformer**: https://github.com/NVIDIA/FasterTransformer