# Bitsandbytes > [AdaGrad (Adaptive Gradient)](https://jmlr.org/papers/v12/duchi11a.html) is an adaptive learning rate optimizer. AdaGrad stores a sum of the squared past gradients for each parameter and uses it to sc ## Pages - [AdaGrad](adagrad.md): [AdaGrad (Adaptive Gradient)](https://jmlr.org/papers/v12/duchi11a.html) is an adaptive learning rate optimizer. AdaG... - [Adam](adam.md): [Adam (Adaptive moment estimation)](https://hf.co/papers/1412.6980) is an adaptive learning rate optimizer, combining... - [AdamW](adamw.md): [AdamW](https://hf.co/papers/1711.05101) is a variant of the`Adam`optimizer that separates weight decay from the gr... - [AdEMAMix](ademamix.md): [AdEMAMix](https://hf.co/papers/2409.03137) is a variant of the`Adam`optimizer. - [Contribution Guide](contributing.md): - Install pre-commit hooks with`pip install pre-commit`. - [Embedding](embeddings.md): The embedding class is used to store and retrieve word embeddings from their indices. There are two types of embeddin... - [Troubleshoot](errors.md): This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA... - [FAQs](faqs.md): Please submit your questions in [this Github Discussion thread](https://github.com/bitsandbytes-foundation/bitsandbyt... - [FSDP-QLoRA](fsdp-qlora.md): FSDP-QLoRA combines data parallelism (FSDP enables sharding model parameters, optimizer states, and gradients across ... - [Overview](functional.md): The`bitsandbytes.functional`API provides the low-level building blocks for the library's features. - [bitsandbytes](index.md): bitsandbytes enables accessible large language models via k-bit quantization for PyTorch. bitsandbytes provides three... - [Installation Guide](installation.md): Welcome to the installation guide for the`bitsandbytes`library! This document provides step-by-step instructions to... - [Integrations](integrations.md): bitsandbytes is widely integrated with many of the libraries in the Hugging Face and wider PyTorch ecosystem. This gu... - [LAMB](lamb.md): [LAMB (Layerwise adaptive large batch optimization)](https://hf.co/papers/1904.00962) is an adaptive optimizer design... - [LARS](lars.md): [LARS (Layer-wise Adaptive Rate Scaling)](https:/hf.co/papers/1708.03888) is an optimizer designed for training with ... - [4-bit quantization](linear4bit.md): [QLoRA](https://hf.co/papers/2305.14314) is a finetuning method that quantizes a model to 4-bits and adds a set of lo... - [LLM.int8()](linear8bit.md): [LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inferen... - [Lion](lion.md): [Lion (Evolved Sign Momentum)](https://hf.co/papers/2302.06675) is a unique optimizer that uses the sign of the gradi... - [Overview](optim-overview.md): [8-bit optimizers](https://hf.co/papers/2110.02861) reduce the memory footprint of 32-bit optimizers without any perf... - [8-bit optimizers](optimizers.md): With 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any accuracy compared to... - [Quickstart](quickstart.md): ... work in progress ... - [Papers, related resources & how to cite](resources.md): The below academic work is ordered in reverse chronological order. - [RMSprop](rmsprop.md): RMSprop is an adaptive learning rate optimizer that is very similar to`Adagrad`. RMSprop stores a *weighted average*... - [SGD](sgd.md): Stochastic gradient descent (SGD) is a basic gradient descent optimizer to minimize loss given a set of model paramet...