# Bitsandbytes

> [AdaGrad (Adaptive Gradient)](https://jmlr.org/papers/v12/duchi11a.html) is an adaptive learning rate optimizer. AdaGrad stores a sum of the squared past gradients for each parameter and uses it to sc

## Pages

- [AdaGrad](adagrad.md): [AdaGrad (Adaptive Gradient)](https://jmlr.org/papers/v12/duchi11a.html) is an adaptive learning rate optimizer. AdaG...
- [Adam](adam.md): [Adam (Adaptive moment estimation)](https://hf.co/papers/1412.6980) is an adaptive learning rate optimizer, combining...
- [AdamW](adamw.md): [AdamW](https://hf.co/papers/1711.05101) is a variant of the`Adam`optimizer that separates weight decay from the gr...
- [AdEMAMix](ademamix.md): [AdEMAMix](https://hf.co/papers/2409.03137) is a variant of the`Adam`optimizer.
- [Contribution Guide](contributing.md): - Install pre-commit hooks with`pip install pre-commit`.
- [Embedding](embeddings.md): The embedding class is used to store and retrieve word embeddings from their indices. There are two types of embeddin...
- [Troubleshoot](errors.md): This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA...
- [FAQs](faqs.md): Please submit your questions in [this Github Discussion thread](https://github.com/bitsandbytes-foundation/bitsandbyt...
- [FSDP-QLoRA](fsdp-qlora.md): FSDP-QLoRA combines data parallelism (FSDP enables sharding model parameters, optimizer states, and gradients across ...
- [Overview](functional.md): The`bitsandbytes.functional`API provides the low-level building blocks for the library's features.
- [bitsandbytes](index.md): bitsandbytes enables accessible large language models via k-bit quantization for PyTorch. bitsandbytes provides three...
- [Installation Guide](installation.md): Welcome to the installation guide for the`bitsandbytes`library! This document provides step-by-step instructions to...
- [Integrations](integrations.md): bitsandbytes is widely integrated with many of the libraries in the Hugging Face and wider PyTorch ecosystem. This gu...
- [LAMB](lamb.md): [LAMB (Layerwise adaptive large batch optimization)](https://hf.co/papers/1904.00962) is an adaptive optimizer design...
- [LARS](lars.md): [LARS (Layer-wise Adaptive Rate Scaling)](https:/hf.co/papers/1708.03888) is an optimizer designed for training with ...
- [4-bit quantization](linear4bit.md): [QLoRA](https://hf.co/papers/2305.14314) is a finetuning method that quantizes a model to 4-bits and adds a set of lo...
- [LLM.int8()](linear8bit.md): [LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inferen...
- [Lion](lion.md): [Lion (Evolved Sign Momentum)](https://hf.co/papers/2302.06675) is a unique optimizer that uses the sign of the gradi...
- [Overview](optim-overview.md): [8-bit optimizers](https://hf.co/papers/2110.02861) reduce the memory footprint of 32-bit optimizers without any perf...
- [8-bit optimizers](optimizers.md): With 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any accuracy compared to...
- [Quickstart](quickstart.md): ... work in progress ...
- [Papers, related resources & how to cite](resources.md): The below academic work is ordered in reverse chronological order.
- [RMSprop](rmsprop.md): RMSprop is an adaptive learning rate optimizer that is very similar to`Adagrad`. RMSprop stores a *weighted average*...
- [SGD](sgd.md): Stochastic gradient descent (SGD) is a basic gradient descent optimizer to minimize loss given a set of model paramet...