# Smoothquant > Main module for applying SmoothQuant to models. --- # SmoothQuant API Reference ## Core Modules ### smoothquant.smooth Main module for applying SmoothQuant to models. #### `smooth_lm(model, scales, alpha=0.5)` Apply SmoothQuant smoothing to a model. **Parameters:** - `model` (nn.Module): Language model to smooth - `scales` (dict): Activation scales per layer - `alpha` (float): Smoothing factor, range [0, 1], default 0.5 **Returns:** None (modifies model in-place) **Example:** ```python from smoothquant.smooth import smooth_lm smooth_lm(model, activation_scales, alpha=0.85) ``` ### smoothquant.fake_quant Provides fake (simulated) quantization functions for testing. #### `fake_quantize_activation_per_token_asymmetric(activation, scales)` Simulate asymmetric per-token INT8 activation quantization. **Parameters:** - `activation` (torch.Tensor): Activation tensor [seq_len, hidden_dim] - `scales` (torch.Tensor): Per-token scales [seq_len] **Returns:** Quantized activation tensor **Example:** ```python from smoothquant.fake_quant import fake_quantize_activation_per_token_asymmetric quant_act = fake_quantize_activation_per_token_asymmetric(activation, scales) ``` #### `fake_quantize_weight_per_channel_symmetric(weight, scales)` Simulate symmetric per-channel INT8 weight quantization. **Parameters:** - `weight` (torch.Tensor): Weight tensor [out_features, in_features] - `scales` (torch.Tensor): Per-channel scales [out_features] **Returns:** Quantized weight tensor ### smoothquant.opt Quantized OPT model implementation. #### `Int8OPTForCausalLM` OPT model with INT8 quantization. **Methods:** - `from_pretrained(model_id)`: Load pre-quantized model from Hugging Face - `generate(**kwargs)`: Generate text (standard Transformers API) - `forward(input_ids, attention_mask, ...)`: Standard forward pass **Example:** ```python from smoothquant.opt import Int8OPTForCausalLM model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant") output = model.generate(input_ids=input_ids) ``` ### smoothquant.calibration Calibration and scale computation. #### `compute_activation_scales(model, data_loader, num_samples=512)` Compute activation scales for quantization. **Parameters:** - `model` (nn.Module): Model to calibrate - `data_loader` (DataLoader): Calibration data loader - `num_samples` (int): Number of calibration samples **Returns:** dict of activation scales per layer **Example:** ```python from smoothquant.calibration import compute_activation_scales scales = compute_activation_scales(model, calib_loader, num_samples=512) ``` ### smoothquant.ppl_eval Perplexity evaluation for quantized models. #### `evaluate_ppl(model, dataset, scales, alpha=0.5, smooth=True, quantize=True)` Evaluate perplexity of quantized model. **Parameters:** - `model` (nn.Module): Model to evaluate - `dataset` (str): Dataset name ('wikitext-2', 'wikitext-103', etc.) - `scales` (dict): Activation scales - `alpha` (float): Smoothing factor - `smooth` (bool): Whether to apply smoothing - `quantize` (bool): Whether to apply fake quantization **Returns:** float (perplexity value) ## Command-line Tools ### generate_act_scales.py Generate activation channel scales for a model. **Usage:** ```bash python examples/generate_act_scales.py \ --model-name \ --output-path \ --num-samples 512 \ --seq-len 512 ``` **Arguments:** - `--model-name`: Hugging Face model ID or local path - `--output-path`: Where to save scales (.pt file) - `--num-samples`: Number of calibration samples (default: 512) - `--seq-len`: Sequence length (default: 512) - `--dataset-path`: Path to calibration dataset - `--per-gpu-batch-size`: Batch size per GPU (default: 8) ### export_int8_model.py Export a model quantized with SmoothQuant. **Usage:** ```bash python examples/export_int8_model.py \ --model_path \ --output_path \ --act_scales_path \ --alpha 0.85 ``` **Arguments:** - `--model_path`: Original model (Hugging Face ID or path) - `--output_path`: Output directory for quantized model - `--act_scales_path`: Path to activation scales file - `--alpha`: Smoothing factor (default: 0.5) ### ppl_eval.py Evaluate perplexity of quantized models. **Usage:** ```bash python smoothquant/ppl_eval.py \ --model_path \ --act_scales_path \ --smooth \ --alpha 0.85 \ --quantize \ --dataset_name wikitext \ --dataset_config wikitext-2 ``` **Arguments:** - `--model_path`: Model to evaluate - `--act_scales_path`: Path to activation scales - `--smooth`: Apply smoothing (flag) - `--alpha`: Smoothing factor - `--quantize`: Apply fake quantization (flag) - `--dataset_name`: Dataset for evaluation - `--dataset_config`: Dataset configuration ## Tensor Specifications ### Activation Scales Format Activation scales are stored as PyTorch tensors in `.pt` files. **Structure:** ```python scales = { 'layer.0': torch.tensor([...]), # Shape: [hidden_dim] or [num_heads] 'layer.1': torch.tensor([...]), # ... } ``` ### Quantization Parameters **INT8 Activation Quantization:** - Range: [-128, 127] - Zero-point: computed per-token - Scale: per-token from data **INT8 Weight Quantization:** - Range: [-128, 127] - Zero-point: 0 (symmetric) - Scale: per-output-channel ## Performance Characteristics ### Computational Complexity - Calibration: O(n*d) where n=num_samples, d=hidden_dim - Smoothing: O(l*d) where l=num_layers - Quantization: O(model_size) ### Memory Usage - Original model: 100% - After INT8 quantization: 50% - During quantization process: 150% (temporary) ### Latency - Typical 1.5x speedup on NVIDIA GPUs - Varies by model size and GPU architecture - Better results with larger batch sizes ## Error Handling ### Common Exceptions ```python # Missing activation scales RuntimeError: "scales not found for layer..." # Incompatible model type ValueError: "unsupported model architecture" # Out of memory during calibration RuntimeError: "CUDA out of memory" ``` ## Version Compatibility - PyTorch: >= 1.12.0 - Transformers: >= 4.36.0 - CUDA: 11.3+ recommended - Python: 3.8+ --- # SmoothQuant Implementation Guide ## Overview This guide covers the main workflows for using SmoothQuant to quantize Large Language Models to INT8. ## Workflow 1: Using Pre-quantized Models The simplest approach is to use pre-quantized OPT models already available on Hugging Face. ### Load Model ```python from smoothquant.opt import Int8OPTForCausalLM # Load pre-quantized INT8 model model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant") # Generate text from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b") inputs = tokenizer("Hello world", return_tensors="pt") outputs = model.generate(**inputs) ``` ### Available Models - `mit-han-lab/opt-125m-smoothquant` - `mit-han-lab/opt-1.3b-smoothquant` - `mit-han-lab/opt-2.7b-smoothquant` - `mit-han-lab/opt-6.7b-smoothquant` - `mit-han-lab/opt-13b-smoothquant` - `mit-han-lab/opt-30b-smoothquant` - `mit-han-lab/opt-66b-smoothquant` ## Workflow 2: Quantize Your Own Model To quantize a model not in the pre-quantized collection: ### Step 1: Generate Activation Scales ```bash python examples/generate_act_scales.py \ --model-name meta-llama/Llama-2-7b \ --output-path llama2-7b-scales.pt \ --num-samples 512 \ --seq-len 512 \ --dataset-path data/calibration_data ``` Parameters: - `--model-name`: Hugging Face model ID or local path - `--output-path`: Where to save activation scales - `--num-samples`: Number of calibration samples (512 recommended) - `--seq-len`: Sequence length (512 typical) - `--dataset-path`: Path to calibration dataset (uses Pile by default) ### Step 2: Export Quantized Model ```bash python examples/export_int8_model.py \ --model_path meta-llama/Llama-2-7b \ --output_path ./llama2-7b-smoothquant \ --act_scales_path llama2-7b-scales.pt \ --alpha 0.85 ``` Parameters: - `--model_path`: Original model (Hugging Face ID or path) - `--output_path`: Where to save quantized model - `--act_scales_path`: Path to activation scales from Step 1 - `--alpha`: Smoothing factor (0.8-0.9 recommended) ### Step 3: Use Quantized Model ```python from smoothquant.opt import Int8OPTForCausalLM from transformers import AutoTokenizer model = Int8OPTForCausalLM.from_pretrained("./llama2-7b-smoothquant") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b") inputs = tokenizer("Explain quantum computing", return_tensors="pt") outputs = model.generate(**inputs, max_length=256) print(tokenizer.decode(outputs[0])) ``` ## Workflow 3: Evaluate Quantized Models Evaluate perplexity of quantized models: ```bash python smoothquant/ppl_eval.py \ --model_path meta-llama/Llama-2-7b \ --act_scales_path llama2-7b-scales.pt \ --smooth \ --alpha 0.85 \ --quantize \ --dataset_name wikitext \ --dataset_config wikitext-2 ``` ## Smoothing Algorithm Details ### Mathematical Foundation SmoothQuant applies per-channel scaling transformation: ``` X_smoothed = X * S W_smoothed = W * S^(-1) ``` Where: - `X` = activation tensor - `W` = weight tensor - `S` = per-channel scaling matrix ### Alpha Parameter The alpha parameter controls the smoothing ratio: ``` S_i = max(|X_i|)^(1-α) * 2^α ``` - `α = 0.5`: Balanced smoothing (default starting point) - `α = 0.8-0.9`: More smoothing toward weights (often better for LLMs) - `α = 1.0`: All difficulty in weights ### Tuning Alpha For best results with your model: 1. Start with α = 0.85 2. If accuracy drops > 1%: increase α to 0.9 3. If no improvement: decrease α to 0.8 ## Performance Optimization ### For Single GPU Fake quantization provides accurate PPL estimation without special hardware: ```python from smoothquant.fake_quant import fake_quantize_activation_per_token_asymmetric import torch # Simulate INT8 inference on FP16 model quantized_output = fake_quantize_activation_per_token_asymmetric( activation, scales ) ``` ### For Multi-GPU Inference Use FasterTransformer for production: ```bash # Compile FasterTransformer with INT8 support git clone https://github.com/NVIDIA/FasterTransformer.git cd FasterTransformer # Build with -DENABLE_INT8=ON flag ``` Then use quantized model with FasterTransformer backend for 1.56x speedup. ### Memory Savings - FP16 model: 1x (baseline) - INT8 model: 0.5x (50% reduction) - With KV cache optimization: Additional 0.5x for attention ## Pre-computed Activation Scales Pre-computed scales available at: https://huggingface.co/mit-han-lab/smoothquant-scales This includes scales for: - OPT (125M - 66B) - BLOOM - Llama-1/2/3 - Falcon - Mistral - Mixtral Use with your model: ```python import torch # Load pre-computed scales scales = torch.load("path/to/scales.pt") # Apply to your model from smoothquant.smooth import smooth_lm smooth_lm(model, scales, alpha=0.85) ``` ## Troubleshooting ### Issue: Low accuracy after quantization **Solution**: Try different alpha values ```bash for alpha in 0.75 0.80 0.85 0.90; do python smoothquant/ppl_eval.py \ --model_path \ --act_scales_path \ --alpha $alpha \ --smooth \ --quantize done ``` ### Issue: CUDA out of memory **Solution**: Reduce batch size during calibration ```bash python examples/generate_act_scales.py \ --model-name \ --num-samples 256 \ --per-gpu-batch-size 1 ``` ### Issue: torch-int not available **Solution**: Install required dependencies ```bash pip install torch-int # If build fails, ensure CUDA toolkit matches your PyTorch installation ``` ## Integration Examples ### With Hugging Face Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load and quantize model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b") # Apply smoothing (see examples/export_int8_model.py for full implementation) from smoothquant.smooth import smooth_lm smooth_lm(model, scales, alpha=0.85) ``` ### With vLLM SmoothQuant quantized models compatible with vLLM for efficient serving: ```bash python -m vllm.entrypoints.openai.api_server \ --model mit-han-lab/opt-30b-smoothquant \ --quantization smoothquant ``` ### With TensorRT-LLM For production inference with maximum performance: ```bash # Build with SmoothQuant support trtllm-build \ --checkpoint_dir ./quantized_model \ --output_dir ./trt_engine \ --quantization smoothquant ``` ## References - Paper: https://arxiv.org/abs/2211.10438 - GitHub: https://github.com/mit-han-lab/smoothquant - Hugging Face: https://huggingface.co/mit-han-lab --- # SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models **Source:** https://github.com/mit-han-lab/smoothquant ## Overview SmoothQuant is a training-free, accuracy-preserving post-training quantization (PTQ) solution for Large Language Models that enables INT8 activation and weight quantization (W8A8). Developed by MIT and NVIDIA researchers, it allows efficient inference on LLMs up to 175B parameters. ### Key Features - **Training-Free**: No fine-tuning required - **Accuracy-Preserving**: Negligible loss in model accuracy - **W8A8 Quantization**: Both weights and activations quantized to INT8 - **Hardware Efficient**: 1.56x speedup and 2x memory reduction - **General Purpose**: Works with OPT, BLOOM, GLM, LLaMA, Falcon, Mistral, Mixtral models ### Technical Innovation SmoothQuant migrates quantization difficulty from activations to weights through mathematically equivalent transformations, smoothing out activation outliers to make both weights and activations easy to quantize. ### Integration Points - NVIDIA TensorRT-LLM (INT8 W8A8 support) - Amazon SageMaker (LLM inference optimization) - Intel Neural-Compressor (quantization toolkit) - Microsoft ONNX Runtime (inference examples) - AMD Instinct MI300X (INT8 GEMM via Composable Kernel) ## Performance Results ### Supported Models with W8A8 Quantization | Model | Method | PPL | Alpha | |-------|--------|-----|-------| | Llama-2-7B | FP16 | 5.474 | - | | | SQ W8A8 | 5.515 | 0.85 | | Llama-2-13B | FP16 | 4.950 | - | | | SQ W8A8 | 4.929 | 0.85 | | Llama-2-70B | FP16 | 3.320 | - | | | SQ W8A8 | 3.359 | 0.9 | | Llama-3-8B | FP16 | 6.138 | - | | | SQ W8A8 | 6.258 | 0.85 | | Mistral-7B | FP16 | 5.253 | - | | | SQ W8A8 | 5.277 | 0.8 | | Mixtral-8x7B | FP16 | 3.842 | - | | | SQ W8A8 | 3.893 | 0.8 | | Falcon-7B | FP16 | 6.590 | - | | | SQ W8A8 | 6.629 | 0.6 | | Falcon-40B | FP16 | 5.228 | - | | | SQ W8A8 | 5.255 | 0.7 | ## Installation ```bash conda create -n smoothquant python=3.8 conda activate smoothquant pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 pip install transformers==4.36.0 accelerate datasets zstandard python setup.py install ``` ## Quick Start ### Load Pre-quantized INT8 Model ```python from smoothquant.opt import Int8OPTForCausalLM model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant") ``` Available pre-quantized models: `opt-125m`, `opt-1.3b`, `opt-2.7b`, `opt-6.7b`, `opt-13b`, `opt-30b`, `opt-66b` ### Generate Activation Scales ```bash python examples/generate_act_scales.py \ --model-name \ --output-path \ --num-samples \ --seq-len \ --dataset-path ``` ### Smooth and Quantize Models ```bash python examples/export_int8_model.py \ --model_path \ --output_path \ --act_scales_path ``` ### Evaluate Perplexity ```bash python smoothquant/ppl_eval.py \ --model_path \ --act_scales_path \ --smooth \ --alpha \ --quantize ``` ## Core Components ### Main Modules - **`smoothquant/smooth.py`**: Core smoothing algorithm implementation - **`smoothquant/fake_quant.py`**: FP16 fake quantization simulation - **`smoothquant/opt.py`**: Quantized OPT model class with INT8 linear layers - **`smoothquant/calibration.py`**: Calibration and scale computation - **`smoothquant/ppl_eval.py`**: Perplexity evaluation for quantized models ### Examples - **`examples/smoothquant_opt_demo.ipynb`**: OPT-13B W8A8 fake quantization demo - **`examples/smoothquant_opt_real_int8_demo.ipynb`**: OPT-30B real INT8 inference on A100 - **`examples/smoothquant_llama_demo.ipynb`**: LLaMA model quantization demo - **`examples/generate_act_scales.py`**: Script to compute activation channel scales - **`examples/export_int8_model.py`**: Export quantized INT8 models - **`examples/ppl_eval.sh`**: Batch evaluation script for multiple models ## Hardware Requirements ### For Real INT8 Inference - Requires NVIDIA GPU with INT8 support (requires torch-int library) - CUTLASS INT8 GEMM kernels wrapped as PyTorch modules ### For Large Models (Multi-GPU) - Recommended: Use FasterTransformer backend for distributed inference - OPT-175B: 4 GPUs with INT8 vs 8 GPUs with FP16 ## Activation Channel Scales Pre-computed activation scales available for: - OPT (all sizes) - BLOOM - Llama (1/2/3) - Falcon - Mistral - Mixtral Scales computed with 512 random sentences from Pile validation set. ## Research Paper **SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models** - **Authors**: Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, Song Han - **Conference**: ICML 2023 - **Paper**: https://arxiv.org/abs/2211.10438 - **Slides**: See assets/SmoothQuant.pdf in repository ## Citation ```bibtex @InProceedings{xiao2023smoothquant, title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models}, author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, year = {2023} } ``` ## References - **GitHub Repository**: https://github.com/mit-han-lab/smoothquant - **Hugging Face Models**: https://huggingface.co/mit-han-lab/ - **Paper**: https://arxiv.org/abs/2211.10438 - **torch-int**: https://github.com/Guangxuan-Xiao/torch-int - **CUTLASS**: https://github.com/NVIDIA/cutlass - **FasterTransformer**: https://github.com/NVIDIA/FasterTransformer