# Squeezellm > conda create --name sqllm python=3.9 -y --- # Installation Guide ## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md ## Prerequisites - Python 3.9 or higher - CUDA 11.3+ (for GPU support) - Conda (recommended) ## Installation Steps ### 1. Create a Conda Environment ```bash conda create --name sqllm python=3.9 -y conda activate sqllm ``` ### 2. Clone and Install Dependencies ```bash git clone https://github.com/SqueezeAILab/SqueezeLLM cd SqueezeLLM pip install -e . cd squeezellm python setup_cuda.py install ``` This will install: - torch - transformers==4.29.0 - accelerate - sentencepiece - tokenizers>=0.12.1 - datasets ### 3. Verify Installation You can verify the installation by importing the squeezellm module: ```python import squeezellm print("SqueezeLLM installed successfully!") ``` ## Dependencies The following dependencies are automatically installed: - **torch**: Deep learning framework for GPU computation - **transformers==4.29.0**: Hugging Face transformers library (specific version required) - **accelerate**: Multi-GPU training and inference utilities - **sentencepiece**: Tokenization library - **tokenizers>=0.12.1**: Fast tokenizer implementation - **datasets**: Dataset loading and processing ## GPU Setup SqueezeLLM requires CUDA for efficient quantization and inference. Ensure your GPU drivers are installed: ```bash # Check CUDA version nvidia-smi # Install CUDA toolkit if needed (if not already installed with torch) conda install cuda-toolkit -c nvidia ``` ## Troubleshooting ### TransformersVersion Error If you encounter errors related to transformers version: ``` ERROR: pip's dependency resolver does not currently take into account all the packages that are installed... ``` This is expected. The framework requires transformers==4.29.0 for compatibility. If needed, you can downgrade: ```bash pip install transformers==4.29.0 ``` ### CUDA Compilation Issues If you encounter CUDA compilation errors during `python setup_cuda.py install`: 1. Ensure CUDA toolkit is properly installed 2. Verify your GPU supports the CUDA version 3. Check that g++ and nvcc are in your PATH ### Module Import Errors If you get `ModuleNotFoundError` when importing squeezellm: ```bash # Reinstall in development mode cd SqueezeLLM pip install -e . cd squeezellm python setup_cuda.py install ``` --- Generated from: https://github.com/SqueezeAILab/SqueezeLLM --- # SqueezeLLM Model Zoo ## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md All pre-quantized models are available from the Squeeze AI Lab on Hugging Face Hub: https://huggingface.co/squeeze-ai-lab ## Model Naming Convention - `sq-{base-model}-{size}-w{bits}-s{sparsity}`: Standard naming format - `sq-llama-7b-w3-s0`: LLaMA-7B, 3-bit, 0% sparsity (dense-only) - `sq-llama-7b-w4-s45`: LLaMA-7B, 4-bit, 0.45% sparsity ## LLaMA (v1) Supported sizes: 7B, 13B, 30B, 65B | Model | 3-bit (Dense) | 3-bit (0.05% S) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.05% S) | 4-bit (0.45% S) | |-------|---|---|---|---|---|---| | **LLaMA-7B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | **LLaMA-13B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | **LLaMA-30B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | **LLaMA-65B** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | **Note**: LLaMA v1 requires base model checkpoint ## LLaMA-2 Supported sizes: 7B, 13B | Model | 3-bit (Dense) | 4-bit (Dense) | |-------|---|---| | **LLaMA-2-7B** | ✓ | ✓ | | **LLaMA-2-13B** | ✓ | ✓ | **Note**: Includes Hugging Face compatible configs ## Mistral Supported models: Mistral-7B (base and instruct) | Model | 3-bit (Dense) | 4-bit (Dense) | |-------|---|---| | **Mistral-7B** | ✓ | ✓ | | **Mistral-7B-Instruct** | ✓ | ✓ | **Added**: November 2024 ## Vicuna (v1.1) Supported sizes: 7B, 13B | Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) | |-------|---|---|---|---| | **Vicuna-7B** | ✓ | ✓ | ✓ | ✓ | | **Vicuna-13B** | ✓ | ✓ | ✓ | ✓ | ## Vicuna (v1.3) Supported sizes: 7B, 13B, 30B (30B coming soon) | Model | 3-bit (Dense) | 4-bit (Dense) | |-------|---|---| | **Vicuna-7B-v1.3** | ✓ | ✓ | | **Vicuna-13B-v1.3** | ✓ | ✓ | | **Vicuna-30B-v1.3** | Coming | Coming | See [FastChat docs](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) for v1.1 vs v1.3 differences. ## XGen (8K Sequence Length) Based on [Salesforce XGen-7B](https://blog.salesforceairesearch.com/xgen/) Models: XGen-7B-8k-Base, XGen-7B-8k-Inst | Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) | |-------|---|---|---|---| | **XGen-7B-8k-Base** | ✓ | ✓ | ✓ | ✓ | | **XGen-7B-8k-Inst** | ✓ | ✓ | ✓ | ✓ | **Key Feature**: 8K context length support ## OPT Supported sizes: 1.3B, 2.7B, 6.7B, 13B, 30B | Model | 3-bit (Dense) | 3-bit (0.45% S) | 4-bit (Dense) | 4-bit (0.45% S) | |-------|---|---|---|---| | **OPT-1.3B** | ✓ | ✓ | ✓ | ✓ | | **OPT-2.7B** | ✓ | ✓ | ✓ | ✓ | | **OPT-6.7B** | ✓ | ✓ | ✓ | ✓ | | **OPT-13B** | ✓ | ✓ | ✓ | ✓ | | **OPT-30B** | ✓ | ✓ | ✓ | ✓ | ## Download Instructions ### Download from Hugging Face Hub ```python from huggingface_hub import hf_hub_download # Example: Download LLaMA-7B 3-bit dense-only model model_name = "squeeze-ai-lab/sq-llama-7b-w3-s0" checkpoint = hf_hub_download(repo_id=model_name, filename="sq-llama-7b-w3-s0.pt") print(f"Downloaded to: {checkpoint}") ``` Or via wget: ```bash wget https://huggingface.co/squeeze-ai-lab/sq-llama-7b-w3-s0/resolve/main/sq-llama-7b-w3-s0.pt ``` ### Manual Download Browse and download directly from: https://huggingface.co/squeeze-ai-lab ## Loading and Using Models See [Quick Start Guide](quickstart.md) for inference examples. For model benchmarking and evaluation: ```bash CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval ``` ## Model Performance Approximate performance metrics (from paper): ### 3-bit Dense-Only Quantization - LLaMA-7B: ~2-3% accuracy drop - Vicuna-7B: ~1-2% accuracy drop - Memory savings: ~75% ### 4-bit Dense-Only Quantization - LLaMA-7B: <1% accuracy drop - Vicuna-7B: <0.5% accuracy drop - Memory savings: ~62% ### 4-bit Dense-and-Sparse (0.45% + 0.05%) - LLaMA-7B: <0.5% accuracy drop (often better!) - Vicuna-7B: Minimal accuracy drop - Memory savings: ~60% with minimal accuracy loss ## Integration with vLLM SqueezeLLM models are officially supported in [vLLM](https://github.com/vllm-project/vllm) for efficient serving. ## Citation If you use SqueezeLLM models, please cite: ```bibtex @article{kim2023squeezellm, title={SqueezeLLM: Dense-and-Sparse Quantization}, author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt}, journal={arXiv}, year={2023} } ``` --- Generated from: https://github.com/SqueezeAILab/SqueezeLLM --- # SqueezeLLM Documentation ## Overview SqueezeLLM is a post-training quantization framework that uses Dense-and-Sparse Quantization to enable efficient LLM serving. **Paper:** [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629) (ICML 2024) **GitHub Repository:** https://github.com/SqueezeAILab/SqueezeLLM **License:** MIT ## Key Features - **Dense-and-Sparse Quantization**: Splits weight matrices into: - Dense component: heavily quantized without affecting model performance - Sparse part: preserves sensitive and outlier parts of weight matrices - **Non-uniform Quantization**: Uses sensitivity-based compression to maintain accuracy - **Efficient LLM Serving**: Serve larger models with smaller memory footprints - **3-bit and 4-bit Precision**: Support for multiple quantization levels - **Variable Sparsity**: Dense-only (0%), 0.05%, and 0.45% sparsity levels ## Supported Models ### LLaMA (v1) - LLaMA-7B, 13B, 30B, 65B - 3-bit and 4-bit quantization - Multiple sparsity levels (0%, 0.05%, 0.45%) ### LLaMA-2 - LLaMA-2-7B, 13B - 3-bit and 4-bit quantization ### Mistral - Mistral-7B (base and instruct variants) - 3-bit and 4-bit quantization ### Vicuna - Vicuna-7B, 13B (v1.1, v1.3) - 3-bit and 4-bit quantization - Multiple sparsity levels ### XGen - XGen-7B-8k-Base, XGen-7B-8k-Inst - 3-bit and 4-bit quantization - 8K sequence length support ### OPT - OPT-1.3B, 2.7B, 6.7B, 13B, 30B - 3-bit and 4-bit quantization - Multiple sparsity levels ## Performance Highlights - Vicuna-7B models can be served in 6 GB of memory - Achieve 2% higher MMLU accuracy than baseline FP16 models - Maintain same latency as full-precision models - Support for integration with vLLM framework ## Documentation Index 1. **Installation Guide** - Setup and environment configuration 2. **Usage & Benchmarking** - Running inference and performance evaluation 3. **From-Scratch Quantization** - Quantizing custom models 4. **Model Zoo** - Available pre-quantized models 5. **Integration** - Using SqueezeLLM with vLLM ## Citation ```bibtex @article{kim2023squeezellm, title={SqueezeLLM: Dense-and-Sparse Quantization}, author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt}, journal={arXiv}, year={2023} } ``` --- Source: https://github.com/SqueezeAILab/SqueezeLLM Generated: 2026-01-01T07:38:04.739695 --- # From-Scratch Quantization Guide ## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/quantization/README.md This guide covers how to quantize custom models using SqueezeLLM from scratch. ## Overview SqueezeLLM quantization involves 4 main steps: 1. **Compute gradients** (Fisher-based sensitivity scores) 2. **Chunk model weights and gradients** (layer granularity) 3. **Generate outlier configuration** (optional, for Dense-and-Sparse) 4. **K-means clustering** (generate non-uniform quantization LUT) 5. **Packing** (save quantized model) ## Prerequisites ### Base Requirements In addition to SqueezeLLM installation dependencies: ```bash pip install scikit-learn==1.3.1 ``` ### Model Checkpoint You must have your own LLaMA Hugging Face checkpoint saved locally at `[MODEL_PATH]`. ## Step 1: Compute Gradients (Fisher-based Sensitivity Score) SqueezeLLM employs the **Fisher Information matrix** as a sensitivity metric to identify which weights are most critical to model performance. ### Using the Separate Framework Use the dedicated gradient computation framework: https://github.com/kssteven418/SqueezeLLM-gradients This framework will: - Compute gradient squares for your target model - Output in the same format as the original Hugging Face checkpoint - Replace weight values with gradient square values ### Running Gradient Computation Follow the instructions in the SqueezeLLM-gradients repository: ```bash git clone https://github.com/kssteven418/SqueezeLLM-gradients cd SqueezeLLM-gradients # Follow README for gradient computation ``` This produces: `[GRADIENT_PATH]` - gradient checkpoint with Fisher scores ## Step 2: Chunk Model Weights and Gradients Both model and gradient checkpoints must be chunked at layer granularity to reduce memory overhead during loading. ### Chunk Model Weights ```bash python chunk_models.py --model [MODEL_PATH] --output [MODEL_CHUNKS_PATH] --model_type llama ``` ### Chunk Gradient Checkpoint ```bash python chunk_models.py --model [GRADIENT_PATH] --output [GRADIENT_CHUNKS_PATH] --model_type llama ``` ### Output This produces: - `[MODEL_CHUNKS_PATH]`: Chunked model weights at layer granularity - `[GRADIENT_CHUNKS_PATH]`: Chunked gradients at layer granularity These chunked formats reduce loading overhead significantly. ## Step 3: Outlier Configuration Generation (Optional) This step is **optional** and only needed for **Dense-and-Sparse (D+S) quantization**. ### Purpose Generates a configuration file defining thresholds for identifying outlier values in weights. ### Run Outlier Configuration ```bash python generate_outlier_config.py --model [MODEL_CHUNKS_PATH] --range [RANGE] --output [OUTLIERS_CONFIG_PATH] ``` ### Arguments - `--model`: Path to chunked model weights from Step 2 - `--range`: Threshold multiplier for T_min and T_max (see paper Section 4.2) - Larger values = fewer outliers - Recommended starting range: **1.5-2.0** - `--output`: Output directory (saves as `[OUTLIERS_CONFIG_PATH]/outlier_config_o{percentage}.json`) ### Output Configuration file: `[OUTLIERS_CONFIG_PATH]/outlier_config_o0.45.json` (example for 0.45% outliers) You will need to fine-tune `--range` to achieve desired outlier percentage. ## Step 4: K-means Clustering (Non-uniform Quantization LUT) Performs K-means clustering to generate the non-uniform quantization look-up table (LUT). ### Dense-Only Quantization ```bash python nuq.py --bit 4 --model_type llama --model [MODEL_CHUNKS_PATH] --gradient [GRADIENT_CHUNKS_PATH] --output [LUT_PATH] ``` ### Dense-and-Sparse Quantization If using D+S quantization with 0.45% outliers and 0.05% sensitive values: ```bash python nuq.py --bit 4 --model_type llama --model [MODEL_CHUNKS_PATH] --gradient [GRADIENT_CHUNKS_PATH] --output [LUT_PATH] --outlier_config [OUTLIERS_CONFIG_PATH]/outlier_config_o0.45.json --sensitivity 0.05 ``` ### Arguments - `--bit`: Quantization bitwidth (3 or 4) - `--model`: Path to chunked model weights - `--gradient`: Path to chunked gradients - `--output`: Output directory for LUT - `--range`: (Optional) Quantize specific layer range, e.g., `0,10` for layers 0-9 - `--outlier_config`: (D+S only) Path to outlier config from Step 3 - `--sensitivity`: (D+S only) Percentage of sensitive values to extract (e.g., 0.05%) ### Performance Note This step is **highly CPU-intensive**. Recommended to run on: - Multi-core CPU systems - Strong CPU performance - Sufficient RAM for model loading ### Output LUT entries saved in: `[LUT_PATH]/lut` ## Step 5: Packing Saves the quantized model in packed format using the LUT from Step 4. ### Dense-Only Packing ```bash python pack.py --model [MODEL_PATH] --wbits 4 --folder [LUT_PATH] --save [PACKED_CKPT_PATH] ``` ### Dense-and-Sparse Packing For D+S quantization (with sparse components): ```bash python pack.py --model [MODEL_PATH] --wbits 4 --folder [LUT_PATH] --save [PACKED_CKPT_PATH] --include_sparse --balance ``` ### Arguments - `--model`: Original model checkpoint path - `--wbits`: Quantization bitwidth (should match Step 4) - `--folder`: Path to LUT directory from Step 4 - `--save`: Output path for packed checkpoint - `--include_sparse`: (D+S only) Include sparse components - `--balance`: (D+S only) Balance sparse weight distribution ### Output Packed checkpoint: `[PACKED_CKPT_PATH]` - Ready for immediate use in inference! ## Complete Example Workflow ### Dense-Only Quantization (3-bit) ```bash # 1. Prepare gradient checkpoint using SqueezeLLM-gradients # (produces [GRADIENT_PATH]) # 2. Chunk models python chunk_models.py --model /path/to/llama-7b --output ./llama7b_chunks --model_type llama python chunk_models.py --model /path/to/gradients --output ./llama7b_grad_chunks --model_type llama # 3. K-means clustering python nuq.py --bit 3 --model_type llama --model ./llama7b_chunks --gradient ./llama7b_grad_chunks --output ./llama7b_lut # 4. Pack model python pack.py --model /path/to/llama-7b --wbits 3 --folder ./llama7b_lut --save ./sq-llama-7b-w3-s0.pt ``` ### Dense-and-Sparse Quantization (4-bit, 0.45% outliers, 0.05% sensitive) ```bash # 1-2. Same as above... # 3. Generate outlier config python generate_outlier_config.py --model ./llama7b_chunks --range 1.8 --output ./llama7b_outliers # 4. K-means clustering with outliers python nuq.py --bit 4 --model_type llama --model ./llama7b_chunks --gradient ./llama7b_grad_chunks --output ./llama7b_lut_ds --outlier_config ./llama7b_outliers/outlier_config_o0.45.json --sensitivity 0.05 # 5. Pack with sparse components python pack.py --model /path/to/llama-7b --wbits 4 --folder ./llama7b_lut_ds --save ./sq-llama-7b-w4-s45.pt --include_sparse --balance ``` ## Supported Model Types - `llama`: LLaMA (all versions) - `llama2`: LLaMA-2 - `mistral`: Mistral - `vicuna`: Vicuna - `xgen`: XGen - `opt`: OPT ## Key Concepts ### Fisher Information Matrix Measures the sensitivity of model outputs to changes in weights. Weights with high Fisher scores are more critical to model performance. ### Non-Uniform Quantization Instead of uniform quantization (fixed step sizes), SqueezeLLM uses K-means to find optimal per-layer quantization levels based on Fisher sensitivity. ### Dense-and-Sparse Quantization Splits weights into: - **Dense**: Aggressively quantized (3-4 bits) - **Sparse**: Full precision, containing outliers and sensitive values (~0.45% + 0.05%) This achieves high compression while maintaining accuracy. ## Performance Expectations - **3-bit Dense-only**: ~1-2% accuracy drop on benchmarks - **4-bit Dense-only**: <1% accuracy drop - **4-bit D+S (0.45% + 0.05%)**: ~0% accuracy drop vs. FP16 See the [research paper](https://arxiv.org/abs/2306.07629) for detailed results. --- Generated from: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/quantization/README.md --- # Quick Start Guide ## Source: https://github.com/SqueezeAILab/SqueezeLLM/blob/main/README.md ## Download Pre-Quantized Models SqueezeLLM provides pre-quantized models on Hugging Face Hub. Models are available for: - **Bitwidth**: 3-bit and 4-bit - **Sparsity**: 0% (dense-only), 0.05%, and 0.45% - **Model families**: LLaMA, LLaMA-2, Mistral, Vicuna, XGen, OPT All models are available at: https://huggingface.co/squeeze-ai-lab Example model naming scheme: - `sq-llama-7b-w3-s0`: LLaMA-7B, 3-bit, dense-only - `sq-llama-7b-w4-s45`: LLaMA-7B, 4-bit, 0.45% sparsity ## Running Inference ### Basic Inference Example ```bash # Download quantized model # Example: sq-llama-7b-w3-s0.pt # Set up environment export CUDA_VISIBLE_DEVICES=0 export MODEL_PATH=/path/to/base/model # Required for LLaMA v1 and Vicuna v1.1 export CKPT_PATH=/path/to/sq-llama-7b-w3-s0.pt # Run inference python llama.py $MODEL_PATH c4 --wbits 3 --load $CKPT_PATH --benchmark 128 ``` ### Benchmarking #### LLaMA Benchmarking ```bash CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --benchmark 128 --check --torch_profile ``` For models with sparsity: ```bash CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --benchmark 128 --check --torch_profile ``` #### XGen Benchmarking ```bash CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --benchmark 128 --check --torch_profile ``` ### Perplexity Evaluation Evaluate model perplexity on C4 dataset: ```bash # Dense-only models CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval # Sparse models CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --eval ``` ## Command-Line Options ### Common Arguments - `--wbits`: Quantization bitwidth (3 or 4) - `--load`: Path to quantized checkpoint - `--benchmark`: Run benchmarking with specified sequence length - `--eval`: Evaluate perplexity - `--check`: Verify quantized model outputs - `--torch_profile`: Enable PyTorch profiling for runtime analysis - `--include_sparse`: Include sparse components (for D+S quantized models) ## GPU Requirements ### Minimum Requirements - 8GB VRAM for 7B models - 16GB VRAM for 13B models ### Recommended Setup - A5000 or A6000 GPU - CUDA 11.3+ - CUDNN 8.2+ ## Integration with vLLM SqueezeLLM is supported in the official vLLM framework for efficient serving: https://github.com/vllm-project/vllm See vLLM documentation for integration details. ## Next Steps - For custom model quantization, see [From-Scratch Quantization Guide](quantization-guide.md) - For detailed model configurations, see [Model Zoo](model-zoo.md) - For implementation details, see [Research Paper](https://arxiv.org/abs/2306.07629) --- Generated from: https://github.com/SqueezeAILab/SqueezeLLM