# Squeezellm

> conda create --name sqllm python=3.9 -y

---

# Installation Guide

## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md

## Prerequisites

- Python 3.9 or higher
- CUDA 11.3+ (for GPU support)
- Conda (recommended)

## Installation Steps

### 1. Create a Conda Environment

```bash
conda create --name sqllm python=3.9 -y
conda activate sqllm
```

### 2. Clone and Install Dependencies

```bash
git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install
```

This will install:
- torch
- transformers==4.29.0
- accelerate
- sentencepiece
- tokenizers>=0.12.1
- datasets

### 3. Verify Installation

You can verify the installation by importing the squeezellm module:

```python
import squeezellm
print("SqueezeLLM installed successfully!")
```

## Dependencies

The following dependencies are automatically installed:

- **torch**: Deep learning framework for GPU computation
- **transformers==4.29.0**: Hugging Face transformers library (specific version required)
- **accelerate**: Multi-GPU training and inference utilities
- **sentencepiece**: Tokenization library
- **tokenizers>=0.12.1**: Fast tokenizer implementation
- **datasets**: Dataset loading and processing

## GPU Setup

SqueezeLLM requires CUDA for efficient quantization and inference. Ensure your GPU drivers are installed:

```bash
# Check CUDA version
nvidia-smi

# Install CUDA toolkit if needed (if not already installed with torch)
conda install cuda-toolkit -c nvidia
```

## Troubleshooting

### TransformersVersion Error

If you encounter errors related to transformers version:

```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed...
```

This is expected. The framework requires transformers==4.29.0 for compatibility. If needed, you can downgrade:

```bash
pip install transformers==4.29.0
```

### CUDA Compilation Issues

If you encounter CUDA compilation errors during `python setup_cuda.py install`:

1. Ensure CUDA toolkit is properly installed
2. Verify your GPU supports the CUDA version
3. Check that g++ and nvcc are in your PATH

### Module Import Errors

If you get `ModuleNotFoundError` when importing squeezellm:

```bash
# Reinstall in development mode
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install
```

---

Generated from: https://github.com/SqueezeAILab/SqueezeLLM

---

# SqueezeLLM Model Zoo

## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md

All pre-quantized models are available from the Squeeze AI Lab on Hugging Face Hub:

https://huggingface.co/squeeze-ai-lab

## Model Naming Convention

- `sq-{base-model}-{size}-w{bits}-s{sparsity}`: Standard naming format
- `sq-llama-7b-w3-s0`: LLaMA-7B, 3-bit, 0% sparsity (dense-only)
- `sq-llama-7b-w4-s45`: LLaMA-7B, 4-bit, 0.45% sparsity

## LLaMA (v1)

Supported sizes: 7B, 13B, 30B, 65B

| Model | 3-bit (Dense) | 3-bit (0.05% S) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.05% S) | 4-bit (0.45% S) |
|-------|---|---|---|---|---|---|
| **LLaMA-7B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| **LLaMA-13B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| **LLaMA-30B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| **LLaMA-65B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

**Note**: LLaMA v1 requires base model checkpoint

## LLaMA-2

Supported sizes: 7B, 13B

| Model | 3-bit (Dense) | 4-bit (Dense) |
|-------|---|---|
| **LLaMA-2-7B** | ✓ | ✓ |
| **LLaMA-2-13B** | ✓ | ✓ |

**Note**: Includes Hugging Face compatible configs

## Mistral

Supported models: Mistral-7B (base and instruct)

| Model | 3-bit (Dense) | 4-bit (Dense) |
|-------|---|---|
| **Mistral-7B** | ✓ | ✓ |
| **Mistral-7B-Instruct** | ✓ | ✓ |

**Added**: November 2024

## Vicuna (v1.1)

Supported sizes: 7B, 13B

| Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) |
|-------|---|---|---|---|
| **Vicuna-7B** | ✓ | ✓ | ✓ | ✓ |
| **Vicuna-13B** | ✓ | ✓ | ✓ | ✓ |

## Vicuna (v1.3)

Supported sizes: 7B, 13B, 30B (30B coming soon)

| Model | 3-bit (Dense) | 4-bit (Dense) |
|-------|---|---|
| **Vicuna-7B-v1.3** | ✓ | ✓ |
| **Vicuna-13B-v1.3** | ✓ | ✓ |
| **Vicuna-30B-v1.3** | Coming | Coming |

See [FastChat docs](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) for v1.1 vs v1.3 differences.

## XGen (8K Sequence Length)

Based on [Salesforce XGen-7B](https://blog.salesforceairesearch.com/xgen/)

Models: XGen-7B-8k-Base, XGen-7B-8k-Inst

| Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) |
|-------|---|---|---|---|
| **XGen-7B-8k-Base** | ✓ | ✓ | ✓ | ✓ |
| **XGen-7B-8k-Inst** | ✓ | ✓ | ✓ | ✓ |

**Key Feature**: 8K context length support

## OPT

Supported sizes: 1.3B, 2.7B, 6.7B, 13B, 30B

| Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) |
|-------|---|---|---|---|
| **OPT-1.3B** | ✓ | ✓ | ✓ | ✓ |
| **OPT-2.7B** | ✓ | ✓ | ✓ | ✓ |
| **OPT-6.7B** | ✓ | ✓ | ✓ | ✓ |
| **OPT-13B** | ✓ | ✓ | ✓ | ✓ |
| **OPT-30B** | ✓ | ✓ | ✓ | ✓ |

## Download Instructions

### Download from Hugging Face Hub

```python
from huggingface_hub import hf_hub_download

# Example: Download LLaMA-7B 3-bit dense-only model
model_name = "squeeze-ai-lab/sq-llama-7b-w3-s0"
checkpoint = hf_hub_download(repo_id=model_name, filename="sq-llama-7b-w3-s0.pt")
print(f"Downloaded to: {checkpoint}")
```

Or via wget:

```bash
wget https://huggingface.co/squeeze-ai-lab/sq-llama-7b-w3-s0/resolve/main/sq-llama-7b-w3-s0.pt
```

### Manual Download

Browse and download directly from:
https://huggingface.co/squeeze-ai-lab

## Loading and Using Models

See [Quick Start Guide](quickstart.md) for inference examples.

For model benchmarking and evaluation:

```bash
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval
```

## Model Performance

Approximate performance metrics (from paper):

### 3-bit Dense-Only Quantization
- LLaMA-7B: ~2-3% accuracy drop
- Vicuna-7B: ~1-2% accuracy drop
- Memory savings: ~75%

### 4-bit Dense-Only Quantization
- LLaMA-7B: <1% accuracy drop
- Vicuna-7B: <0.5% accuracy drop
- Memory savings: ~62%

### 4-bit Dense-and-Sparse (0.45% + 0.05%)
- LLaMA-7B: <0.5% accuracy drop (often better!)
- Vicuna-7B: Minimal accuracy drop
- Memory savings: ~60% with minimal accuracy loss

## Integration with vLLM

SqueezeLLM models are officially supported in [vLLM](https://github.com/vllm-project/vllm) for efficient serving.

## Citation

If you use SqueezeLLM models, please cite:

```bibtex
@article{kim2023squeezellm,
  title={SqueezeLLM: Dense-and-Sparse Quantization},
  author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt},
  journal={arXiv},
  year={2023}
}
```

---

Generated from: https://github.com/SqueezeAILab/SqueezeLLM

---

# SqueezeLLM Documentation

## Overview

SqueezeLLM is a post-training quantization framework that uses Dense-and-Sparse Quantization to enable efficient LLM serving.

**Paper:** [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629) (ICML 2024)
**GitHub Repository:** https://github.com/SqueezeAILab/SqueezeLLM
**License:** MIT

## Key Features

- **Dense-and-Sparse Quantization**: Splits weight matrices into:
  - Dense component: heavily quantized without affecting model performance
  - Sparse part: preserves sensitive and outlier parts of weight matrices
- **Non-uniform Quantization**: Uses sensitivity-based compression to maintain accuracy
- **Efficient LLM Serving**: Serve larger models with smaller memory footprints
- **3-bit and 4-bit Precision**: Support for multiple quantization levels
- **Variable Sparsity**: Dense-only (0%), 0.05%, and 0.45% sparsity levels

## Supported Models

### LLaMA (v1)
- LLaMA-7B, 13B, 30B, 65B
- 3-bit and 4-bit quantization
- Multiple sparsity levels (0%, 0.05%, 0.45%)

### LLaMA-2
- LLaMA-2-7B, 13B
- 3-bit and 4-bit quantization

### Mistral
- Mistral-7B (base and instruct variants)
- 3-bit and 4-bit quantization

### Vicuna
- Vicuna-7B, 13B (v1.1, v1.3)
- 3-bit and 4-bit quantization
- Multiple sparsity levels

### XGen
- XGen-7B-8k-Base, XGen-7B-8k-Inst
- 3-bit and 4-bit quantization
- 8K sequence length support

### OPT
- OPT-1.3B, 2.7B, 6.7B, 13B, 30B
- 3-bit and 4-bit quantization
- Multiple sparsity levels

## Performance Highlights

- Vicuna-7B models can be served in 6 GB of memory
- Achieve 2% higher MMLU accuracy than baseline FP16 models
- Maintain same latency as full-precision models
- Support for integration with vLLM framework

## Documentation Index

1. **Installation Guide** - Setup and environment configuration
2. **Usage & Benchmarking** - Running inference and performance evaluation
3. **From-Scratch Quantization** - Quantizing custom models
4. **Model Zoo** - Available pre-quantized models
5. **Integration** - Using SqueezeLLM with vLLM

## Citation

```bibtex
@article{kim2023squeezellm,
  title={SqueezeLLM: Dense-and-Sparse Quantization},
  author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt},
  journal={arXiv},
  year={2023}
}
```

---

Source: https://github.com/SqueezeAILab/SqueezeLLM
Generated: 2026-01-01T07:38:04.739695

---

# From-Scratch Quantization Guide

## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/quantization/README.md

This guide covers how to quantize custom models using SqueezeLLM from scratch.

## Overview

SqueezeLLM quantization involves 4 main steps:

1. **Compute gradients** (Fisher-based sensitivity scores)
2. **Chunk model weights and gradients** (layer granularity)
3. **Generate outlier configuration** (optional, for Dense-and-Sparse)
4. **K-means clustering** (generate non-uniform quantization LUT)
5. **Packing** (save quantized model)

## Prerequisites

### Base Requirements

In addition to SqueezeLLM installation dependencies:

```bash
pip install scikit-learn==1.3.1
```

### Model Checkpoint

You must have your own LLaMA Hugging Face checkpoint saved locally at `[MODEL_PATH]`.

## Step 1: Compute Gradients (Fisher-based Sensitivity Score)

SqueezeLLM employs the **Fisher Information matrix** as a sensitivity metric to identify which weights are most critical to model performance.

### Using the Separate Framework

Use the dedicated gradient computation framework:

https://github.com/kssteven418/SqueezeLLM-gradients

This framework will:
- Compute gradient squares for your target model
- Output in the same format as the original Hugging Face checkpoint
- Replace weight values with gradient square values

### Running Gradient Computation

Follow the instructions in the SqueezeLLM-gradients repository:

```bash
git clone https://github.com/kssteven418/SqueezeLLM-gradients
cd SqueezeLLM-gradients
# Follow README for gradient computation
```

This produces: `[GRADIENT_PATH]` - gradient checkpoint with Fisher scores

## Step 2: Chunk Model Weights and Gradients

Both model and gradient checkpoints must be chunked at layer granularity to reduce memory overhead during loading.

### Chunk Model Weights

```bash
python chunk_models.py --model [MODEL_PATH] --output [MODEL_CHUNKS_PATH] --model_type llama
```

### Chunk Gradient Checkpoint

```bash
python chunk_models.py --model [GRADIENT_PATH] --output [GRADIENT_CHUNKS_PATH] --model_type llama
```

### Output

This produces:
- `[MODEL_CHUNKS_PATH]`: Chunked model weights at layer granularity
- `[GRADIENT_CHUNKS_PATH]`: Chunked gradients at layer granularity

These chunked formats reduce loading overhead significantly.

## Step 3: Outlier Configuration Generation (Optional)

This step is **optional** and only needed for **Dense-and-Sparse (D+S) quantization**.

### Purpose

Generates a configuration file defining thresholds for identifying outlier values in weights.

### Run Outlier Configuration

```bash
python generate_outlier_config.py --model [MODEL_CHUNKS_PATH] --range [RANGE] --output [OUTLIERS_CONFIG_PATH]
```

### Arguments

- `--model`: Path to chunked model weights from Step 2
- `--range`: Threshold multiplier for T_min and T_max (see paper Section 4.2)
  - Larger values = fewer outliers
  - Recommended starting range: **1.5-2.0**
- `--output`: Output directory (saves as `[OUTLIERS_CONFIG_PATH]/outlier_config_o{percentage}.json`)

### Output

Configuration file: `[OUTLIERS_CONFIG_PATH]/outlier_config_o0.45.json` (example for 0.45% outliers)

You will need to fine-tune `--range` to achieve desired outlier percentage.

## Step 4: K-means Clustering (Non-uniform Quantization LUT)

Performs K-means clustering to generate the non-uniform quantization look-up table (LUT).

### Dense-Only Quantization

```bash
python nuq.py --bit 4 --model_type llama --model [MODEL_CHUNKS_PATH] --gradient [GRADIENT_CHUNKS_PATH] --output [LUT_PATH]
```

### Dense-and-Sparse Quantization

If using D+S quantization with 0.45% outliers and 0.05% sensitive values:

```bash
python nuq.py --bit 4 --model_type llama --model [MODEL_CHUNKS_PATH] --gradient [GRADIENT_CHUNKS_PATH] --output [LUT_PATH] --outlier_config [OUTLIERS_CONFIG_PATH]/outlier_config_o0.45.json --sensitivity 0.05
```

### Arguments

- `--bit`: Quantization bitwidth (3 or 4)
- `--model`: Path to chunked model weights
- `--gradient`: Path to chunked gradients
- `--output`: Output directory for LUT
- `--range`: (Optional) Quantize specific layer range, e.g., `0,10` for layers 0-9
- `--outlier_config`: (D+S only) Path to outlier config from Step 3
- `--sensitivity`: (D+S only) Percentage of sensitive values to extract (e.g., 0.05%)

### Performance Note

This step is **highly CPU-intensive**. Recommended to run on:
- Multi-core CPU systems
- Strong CPU performance
- Sufficient RAM for model loading

### Output

LUT entries saved in: `[LUT_PATH]/lut`

## Step 5: Packing

Saves the quantized model in packed format using the LUT from Step 4.

### Dense-Only Packing

```bash
python pack.py --model [MODEL_PATH] --wbits 4 --folder [LUT_PATH] --save [PACKED_CKPT_PATH]
```

### Dense-and-Sparse Packing

For D+S quantization (with sparse components):

```bash
python pack.py --model [MODEL_PATH] --wbits 4 --folder [LUT_PATH] --save [PACKED_CKPT_PATH] --include_sparse --balance
```

### Arguments

- `--model`: Original model checkpoint path
- `--wbits`: Quantization bitwidth (should match Step 4)
- `--folder`: Path to LUT directory from Step 4
- `--save`: Output path for packed checkpoint
- `--include_sparse`: (D+S only) Include sparse components
- `--balance`: (D+S only) Balance sparse weight distribution

### Output

Packed checkpoint: `[PACKED_CKPT_PATH]` - Ready for immediate use in inference!

## Complete Example Workflow

### Dense-Only Quantization (3-bit)

```bash
# 1. Prepare gradient checkpoint using SqueezeLLM-gradients
# (produces [GRADIENT_PATH])

# 2. Chunk models
python chunk_models.py --model /path/to/llama-7b --output ./llama7b_chunks --model_type llama
python chunk_models.py --model /path/to/gradients --output ./llama7b_grad_chunks --model_type llama

# 3. K-means clustering
python nuq.py --bit 3 --model_type llama     --model ./llama7b_chunks     --gradient ./llama7b_grad_chunks     --output ./llama7b_lut

# 4. Pack model
python pack.py --model /path/to/llama-7b     --wbits 3     --folder ./llama7b_lut     --save ./sq-llama-7b-w3-s0.pt
```

### Dense-and-Sparse Quantization (4-bit, 0.45% outliers, 0.05% sensitive)

```bash
# 1-2. Same as above...

# 3. Generate outlier config
python generate_outlier_config.py     --model ./llama7b_chunks     --range 1.8     --output ./llama7b_outliers

# 4. K-means clustering with outliers
python nuq.py --bit 4 --model_type llama     --model ./llama7b_chunks     --gradient ./llama7b_grad_chunks     --output ./llama7b_lut_ds     --outlier_config ./llama7b_outliers/outlier_config_o0.45.json     --sensitivity 0.05

# 5. Pack with sparse components
python pack.py --model /path/to/llama-7b     --wbits 4     --folder ./llama7b_lut_ds     --save ./sq-llama-7b-w4-s45.pt     --include_sparse     --balance
```

## Supported Model Types

- `llama`: LLaMA (all versions)
- `llama2`: LLaMA-2
- `mistral`: Mistral
- `vicuna`: Vicuna
- `xgen`: XGen
- `opt`: OPT

## Key Concepts

### Fisher Information Matrix

Measures the sensitivity of model outputs to changes in weights. Weights with high Fisher scores are more critical to model performance.

### Non-Uniform Quantization

Instead of uniform quantization (fixed step sizes), SqueezeLLM uses K-means to find optimal per-layer quantization levels based on Fisher sensitivity.

### Dense-and-Sparse Quantization

Splits weights into:
- **Dense**: Aggressively quantized (3-4 bits)
- **Sparse**: Full precision, containing outliers and sensitive values (~0.45% + 0.05%)

This achieves high compression while maintaining accuracy.

## Performance Expectations

- **3-bit Dense-only**: ~1-2% accuracy drop on benchmarks
- **4-bit Dense-only**: <1% accuracy drop
- **4-bit D+S (0.45% + 0.05%)**: ~0% accuracy drop vs. FP16

See the [research paper](https://arxiv.org/abs/2306.07629) for detailed results.

---

Generated from: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/quantization/README.md

---

# Quick Start Guide

## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md

## Download Pre-Quantized Models

SqueezeLLM provides pre-quantized models on Hugging Face Hub. Models are available for:

- **Bitwidth**: 3-bit and 4-bit
- **Sparsity**: 0% (dense-only), 0.05%, and 0.45%
- **Model families**: LLaMA, LLaMA-2, Mistral, Vicuna, XGen, OPT

All models are available at: https://huggingface.co/squeeze-ai-lab

Example model naming scheme:
- `sq-llama-7b-w3-s0`: LLaMA-7B, 3-bit, dense-only
- `sq-llama-7b-w4-s45`: LLaMA-7B, 4-bit, 0.45% sparsity

## Running Inference

### Basic Inference Example

```bash
# Download quantized model
# Example: sq-llama-7b-w3-s0.pt

# Set up environment
export CUDA_VISIBLE_DEVICES=0
export MODEL_PATH=/path/to/base/model  # Required for LLaMA v1 and Vicuna v1.1
export CKPT_PATH=/path/to/sq-llama-7b-w3-s0.pt

# Run inference
python llama.py $MODEL_PATH c4 --wbits 3 --load $CKPT_PATH --benchmark 128
```

### Benchmarking

#### LLaMA Benchmarking

```bash
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --benchmark 128 --check --torch_profile
```

For models with sparsity:

```bash
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --benchmark 128 --check --torch_profile
```

#### XGen Benchmarking

```bash
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --benchmark 128 --check --torch_profile
```

### Perplexity Evaluation

Evaluate model perplexity on C4 dataset:

```bash
# Dense-only models
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval

# Sparse models
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --eval
```

## Command-Line Options

### Common Arguments

- `--wbits`: Quantization bitwidth (3 or 4)
- `--load`: Path to quantized checkpoint
- `--benchmark`: Run benchmarking with specified sequence length
- `--eval`: Evaluate perplexity
- `--check`: Verify quantized model outputs
- `--torch_profile`: Enable PyTorch profiling for runtime analysis
- `--include_sparse`: Include sparse components (for D+S quantized models)

## GPU Requirements

### Minimum Requirements
- 8GB VRAM for 7B models
- 16GB VRAM for 13B models

### Recommended Setup
- A5000 or A6000 GPU
- CUDA 11.3+
- CUDNN 8.2+

## Integration with vLLM

SqueezeLLM is supported in the official vLLM framework for efficient serving:

https://github.com/vllm-project/vllm

See vLLM documentation for integration details.

## Next Steps

- For custom model quantization, see [From-Scratch Quantization Guide](quantization-guide.md)
- For detailed model configurations, see [Model Zoo](model-zoo.md)
- For implementation details, see [Research Paper](https://arxiv.org/abs/2306.07629)

---

Generated from: https://github.com/SqueezeAILab/SqueezeLLM