# Text Embeddings Inference > # CLI arguments To see all options to serve your models, run the following: ```console $ text-embeddings-router --help Text Embedding Webserver Usage: text-embeddings-router [OPTIONS] --model-id Options: --model-id The Hugging Face model ID, can be any model listed on with the `text-embeddings-inference` tag (meaning it's compatible with Text Embeddings Inference). Alternatively, the specified ID can also be a path to a local directory containing the necessary model files saved by the `save_pretrained(...)` methods of either Transformers or Sentence Transformers. [env: MODEL_ID=] --revision The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2` [env: REVISION=] --tokenization-workers Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine [env: TOKENIZATION_WORKERS=] --dtype The dtype to be forced upon the model [env: DTYPE=] [possible values: float16, float32] --served-model-name The name of the model that is being served. If not specified, defaults to `--model-id`. It is only used for the OpenAI-compatible endpoints via HTTP [env: SERVED_MODEL_NAME=] --pooling Optionally control the pooling method for embedding models. If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json` configuration. If `pooling` is set, it will override the model pooling configuration [env: POOLING=] Possible values: - cls: Select the CLS token as embedding - mean: Apply Mean pooling to the model embeddings - splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model - last-token: Select the last token as embedding --max-concurrent-requests The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly [env: MAX_CONCURRENT_REQUESTS=] [default: 512] --max-batch-tokens **IMPORTANT** This is one critical control to allow maximum usage of the available hardware. This represents the total amount of potential tokens within a batch. For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens. Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically. [env: MAX_BATCH_TOKENS=] [default: 16384] --max-batch-requests Optionally control the maximum number of individual requests in a batch [env: MAX_BATCH_REQUESTS=] --max-client-batch-size Control the maximum number of inputs that a client can send in a single request [env: MAX_CLIENT_BATCH_SIZE=] [default: 32] --auto-truncate Automatically truncate inputs that are longer than the maximum supported size Unused for gRPC servers [env: AUTO_TRUNCATE=] --default-prompt-name The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied. Must be a key in the `sentence-transformers` configuration `prompts` dictionary. For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode. The argument '--default-prompt-name ' cannot be used with '--default-prompt ` [env: DEFAULT_PROMPT_NAME=] --default-prompt The prompt that should be used by default for encoding. If not set, no prompt will be applied. For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode. The argument '--default-prompt ' cannot be used with '--default-prompt-name ` [env: DEFAULT_PROMPT=] --dense-path Optionally, define the path to the Dense module required for some embedding models. Some embedding models require an extra `Dense` module which contains a single Linear layer and an activation function. By default, those `Dense` modules are stored under the `2_Dense` directory, but there might be cases where different `Dense` modules are provided, to convert the pooled embeddings into different dimensions, available as `2_Dense_` e.g. https://huggingface.co/NovaSearch/stella_en_400M_v5. Note that this argument is optional, only required to be set if there is no `modules.json` file or when you want to override a single Dense module path, only when running with the `candle` backend. [env: DENSE_PATH=] --hf-token Your Hugging Face Hub token. If neither `--hf-token` nor `HF_TOKEN` is set, the token will be read from the `$HF_HOME/token` path, if it exists. This ensures access to private or gated models, and allows for a more permissive rate limiting [env: HF_TOKEN=] --hostname The IP address to listen on [env: HOSTNAME=] [default: 0.0.0.0] -p, --port The port to listen on [env: PORT=] [default: 3000] --uds-path The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC [env: UDS_PATH=] [default: /tmp/text-embeddings-inference-server] --huggingface-hub-cache The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance [env: HUGGINGFACE_HUB_CACHE=] --payload-limit Payload size limit in bytes Default is 2MB [env: PAYLOAD_LIMIT=] [default: 2000000] --api-key Set an api key for request authorization. By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token. [env: API_KEY=] --json-output Outputs the logs in JSON format (useful for telemetry) [env: JSON_OUTPUT=] --disable-spans Whether or not to include the log trace through spans [env: DISABLE_SPANS=] --otlp-endpoint The grpc endpoint for opentelemetry. Telemetry is sent to this endpoint as OTLP over gRPC. e.g. `http://localhost:4317` [env: OTLP_ENDPOINT=] --otlp-service-name The service name for opentelemetry. e.g. `text-embeddings-inference.server` [env: OTLP_SERVICE_NAME=] [default: text-embeddings-inference.server] --prometheus-port The Prometheus port to listen on [env: PROMETHEUS_PORT=] [default: 9000] --cors-allow-origin Unused for gRPC servers [env: CORS_ALLOW_ORIGIN=] -h, --help Print help (see a summary with '-h') -V, --version Print version ``` --- # Build a custom container for TEI You can build our own CPU or CUDA TEI container using Docker. To build a CPU container, run the following command in the directory containing your custom Dockerfile: ```shell docker build . ``` To build a CUDA container, it is essential to determine the compute capability (compute cap) of the GPU that will be used at runtime. This information is crucial for the proper configuration of the CUDA containers. The following are the examples of runtime compute capabilities for various GPU types: - Turing (T4, RTX 2000 series, ...) - `runtime_compute_cap=75` - A100 - `runtime_compute_cap=80` - A10 - `runtime_compute_cap=86` - Ada Lovelace (RTX 4000 series, ...) - `runtime_compute_cap=89` - H100 - `runtime_compute_cap=90` Once you have determined the compute capability is determined, set it as the `runtime_compute_cap` variable and build the container as shown in the example below: ```shell # Get submodule dependencies git submodule update --init runtime_compute_cap=80 docker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap ``` --- # Example uses - [Set up an Inference Endpoint with TEI](https://huggingface.co/learn/cookbook/automatic_embedding_tei_inference_endpoints) - [RAG containers with TEI](https://github.com/plaggy/rag-containers) --- # Text Embeddings Inference Text Embeddings Inference (TEI) is a comprehensive toolkit designed for efficient deployment and serving of open source text embeddings models. It enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5. TEI offers multiple features tailored to optimize the deployment process and enhance overall performance. ## Key Features * **Streamlined Deployment:** TEI eliminates the need for a model graph compilation step for an easier deployment process. * **Efficient Resource Utilization:** Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities. * **Dynamic Batching:** TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference. * **Optimized Inference:** TEI leverages [Flash Attention](https://github.com/HazyResearch/flash-attention), [Candle](https://github.com/huggingface/candle), and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) by using optimized transformers code for inference. * **Safetensors weight loading:** TEI loads [Safetensors](https://github.com/huggingface/safetensors) weights for faster boot times. * **Production-Ready:** TEI supports distributed tracing through Open Telemetry and exports Prometheus metrics. ## Benchmarks Benchmark for [BAAI/bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5) on an NVIDIA A10 with a sequence length of 512 tokens:

Latency comparison for batch size of 1 Throughput comparison for batch size of 1

Latency comparison for batch size of 32 Throughput comparison for batch size of 32

## Getting Started To start using TEI, check the [Quick Tour](quick_tour) guide. --- # Using TEI Container with Intel® Hardware This guide explains how to build and deploy `text-embeddings-inference` containers optimized for Intel® hardware, including CPUs, XPUs, and HPUs. ## CPU ### Build Docker Image (CPU) To build a container optimized for Intel® CPUs, run the following command: ```shell platform="cpu" docker build . -f Dockerfile-intel --build-arg PLATFORM=$platform -t tei_cpu_ipex ``` ### Deploy Docker Container (CPU) To deploy your model on an Intel® CPU, use the following command: ```shell model='Qwen/Qwen3-Embedding-0.6B' volume=$PWD/data docker run -p 8080:80 -v $volume:/data tei_cpu_ipex --model-id $model ``` ## XPU ### Build Docker Image (XPU) To build a container optimized for Intel® XPUs, run the following command: ```shell platform="xpu" docker build . -f Dockerfile-intel --build-arg PLATFORM=$platform -t tei_xpu_ipex ``` ### Deploy Docker Container (XPU) To deploy your model on an Intel® XPU, use the following command: ```shell model='Qwen/Qwen3-Embedding-0.6B' volume=$PWD/data docker run -p 8080:80 -v $volume:/data --device=/dev/dri -v /dev/dri/by-path:/dev/dri/by-path tei_xpu_ipex --model-id $model --dtype float16 ``` ## HPU > [!WARNING] > TEI is supported only on Gaudi 2 and Gaudi 3. Gaudi 1 is **not** supported. ### Build Docker Image (HPU) To build a container optimized for Intel® HPUs (Gaudi), run the following command: ```shell platform="hpu" docker build . -f Dockerfile-intel --build-arg PLATFORM=$platform -t tei_hpu ``` ### Deploy Docker Container (HPU) To deploy your model on an Intel® HPU (Gaudi), use the following command: ```shell model='Qwen/Qwen3-Embedding-0.6B' volume=$PWD/data docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e MAX_WARMUP_SEQUENCE_LENGTH=512 tei_hpu --model-id $model --dtype bfloat16 ``` ## Prebuilt Docker Images For convenience, prebuilt Docker images are available on GitHub Container Registry (GHCR). You can pull these images directly without the need to build them manually: ### Prebuilt CPU To use the prebuilt image optimized for Intel® CPUs, run: ```shell docker pull ghcr.io/huggingface/text-embeddings-inference:cpu-ipex-latest ``` ### Prebuilt XPU To use the prebuilt image optimized for Intel® XPUs, run: ```shell docker pull ghcr.io/huggingface/text-embeddings-inference:xpu-ipex-latest ``` ### Prebuilt HPU > [!WARNING] > TEI is supported only on Gaudi 2 and Gaudi 3. Gaudi 1 is **not** supported. To use the prebuilt image optimized for Intel® HPUs (Gaudi), run: ```shell docker pull ghcr.io/huggingface/text-embeddings-inference:hpu-latest ``` --- # Using TEI locally with CPU You can install `text-embeddings-inference` locally to run it on your own machine. Here are the step-by-step instructions for installation: ## Step 1: Install Rust [Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions: ```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ``` ## Step 2: Install necessary packages Depending on your machine's architecture, run one of the following commands: ### For x86 Machines ```shell cargo install --path router -F mkl ``` ### For M1 or M2 Machines ```shell cargo install --path router -F metal ``` ## Step 3: Launch Text Embeddings Inference Once the installation is successfully complete, you can launch Text Embeddings Inference on CPU with the following command: ```shell model=Qwen/Qwen3-Embedding-0.6B text-embeddings-router --model-id $model --port 8080 ``` In some cases, you might also need the OpenSSL libraries and gcc installed. On Linux machines, run the following command: ```shell sudo apt-get install libssl-dev gcc -y ``` Now you are ready to use `text-embeddings-inference` locally on your machine. If you want to run TEI locally with a GPU, check out the [Using TEI locally with GPU](local_gpu) page. --- # Using TEI locally with GPU You can install `text-embeddings-inference` locally to run it on your own machine with a GPU. To make sure that your hardware is supported, check out the [Supported models and hardware](supported_models) page. ## Step 1: CUDA and NVIDIA drivers Make sure you have CUDA and the NVIDIA drivers installed - NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher. Add the NVIDIA binaries to your path: ```shell export PATH=$PATH:/usr/local/cuda/bin ``` ## Step 2: Install Rust [Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions: ```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ``` ## Step 3: Install necessary packages This step can take a while as we need to compile a lot of CUDA kernels. ### For Turing GPUs (T4, RTX 2000 series ... ) ```shell cargo install --path router -F candle-cuda-turing ``` ### For Ampere and Hopper ```shell cargo install --path router -F candle-cuda ``` ## Step 4: Launch Text Embeddings Inference You can now launch Text Embeddings Inference on GPU with: ```shell model=Qwen/Qwen3-Embedding-0.6B text-embeddings-router --model-id $model --dtype float16 --port 8080 ``` --- # Using TEI locally with Metal You can install `text-embeddings-inference` locally to run it on your own Mac with Metal support. Here are the step-by-step instructions for installation: ## Step 1: Install Rust [Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions: ```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ``` ## Step 2: Install with Metal support ```shell cargo install --path router -F metal ``` ## Step 3: Launch Text Embeddings Inference Once the installation is successfully complete, you can launch Text Embeddings Inference with Metal with the following command: ```shell model=Qwen/Qwen3-Embedding-0.6B text-embeddings-router --model-id $model --port 8080 ``` Now you are ready to use `text-embeddings-inference` locally on your machine. --- # Serving private and gated models If the model you wish to serve is behind gated access or resides in a private model repository on Hugging Face Hub, you will need to have access to the model to serve it. Once you have confirmed that you have access to the model: - Navigate to your account's [Profile | Settings | Access Tokens page](https://huggingface.co/settings/tokens). - Generate and copy a read token. If you're the CLI, set the `HF_TOKEN` environment variable. For example: ```shell export HF_TOKEN= ``` Alternatively, you can provide the token when deploying the model with Docker: ```shell model= volume=$PWD/data token= docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model ``` --- # Quick Tour ## Set up The easiest way to get started with TEI is to use one of the official Docker containers (see [Supported models and hardware](supported_models) to choose the right container). Hence one needs to install Docker following their [installation instructions](https://docs.docker.com/get-docker/). TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking [this table](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images). Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher. ## Deploy Next it's time to deploy your model. Let's say you want to use [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B). Here's how you can do this: ```shell model=Qwen/Qwen3-Embedding-0.6B volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model ``` We also recommend sharing a volume with the Docker container (`volume=$PWD/data`) to avoid downloading weights every run. ## Inference Inference can be performed in 3 ways: using cURL, or via the `InferenceClient` or `OpenAI` Python SDKs. ### cURL To send a POST request to the TEI endpoint using cURL, you can run the following command: ```bash curl 127.0.0.1:8080/embed \ -X POST \ -d '{"inputs":"What is Deep Learning?"}' \ -H 'Content-Type: application/json' ``` ### Python To run inference using Python, you can either use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/en/index) Python SDK (recommended) or the `openai` Python SDK. #### huggingface_hub You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run: ```python from huggingface_hub import InferenceClient client = InferenceClient() embedding = client.feature_extraction("What is deep learning?", model="http://localhost:8080/embed") print(len(embedding[0])) ``` ### OpenAI To send requests to the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create) exposed on Text Embeddings Inference (TEI) with the OpenAI Python SDK, you can install it as `pip install --upgrade openai`, and then run the following snippet: ```python import os from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key= "-") response = client.embeddings.create( model="text-embeddings-inference", input="What is Deep Learning?", ) print(response.data[0].embedding) ``` Alternatively, you can also send the request with cURL as follows: ```bash curl http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "input": "What is Deep Learning?", "model": "text-embeddings-inference", "encoding_format": "float" }' ``` ## Re-rankers and sequence classification TEI also supports re-ranker and classic sequence classification models. ### Re-rankers Rerankers, also called cross-encoders, are sequence classification models with a single class that score the similarity between a query and a text. See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance. Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large). First, you can deploy it like so: ```shell model=BAAI/bge-reranker-large volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model ``` Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so: ```bash curl 127.0.0.1:8080/rerank \ -X POST \ -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \ -H 'Content-Type: application/json' ``` ### Sequence classification models You can also use classic Sequence Classification models like [`SamLowe/roberta-base-go_emotions`](https://huggingface.co/SamLowe/roberta-base-go_emotions): ```shell model=SamLowe/roberta-base-go_emotions volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model ``` Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input: ```bash curl 127.0.0.1:8080/predict \ -X POST \ -d '{"inputs":"I like you."}' \ -H 'Content-Type: application/json' ``` ## Batching You can send multiple inputs in a batch. For example, for embeddings: ```bash curl 127.0.0.1:8080/embed \ -X POST \ -d '{"inputs":["Today is a nice day", "I like you"]}' \ -H 'Content-Type: application/json' ``` And for Sequence Classification: ```bash curl 127.0.0.1:8080/predict \ -X POST \ -d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \ -H 'Content-Type: application/json' ``` ## Air gapped deployment To deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside the container using a volume. For example: ```shell # (Optional) create a `models` directory mkdir models cd models # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5 # Set the models directory as the volume path volume=$PWD # Mount the models directory inside the container with a volume and set the model ID docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/gte-base-en-v1.5 ``` --- # Supported models and hardware We are continually expanding our support for other model types and plan to include them in future updates. ## Supported embeddings models Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, ModernBERT, Qwen3, and Gemma3. Below are some examples of the currently supported models: | MTEB Rank | Model Size | Model Type | Model ID | |-----------|------------------------|----------------|--------------------------------------------------------------------------------------------------| | 2 | 7.57B (Very Expensive) | Qwen3 | [Qwen/Qwen3-Embedding-8B](https://hf.co/Qwen/Qwen3-Embedding-8B) | | 3 | 4.02B (Very Expensive) | Qwen3 | [Qwen/Qwen3-Embedding-4B](https://hf.co/Qwen/Qwen3-Embedding-4B) | | 4 | 509M | Qwen3 | [Qwen/Qwen3-Embedding-0.6B](https://hf.co/Qwen/Qwen3-Embedding-0.6B) | | 6 | 7.61B (Very Expensive) | Qwen2 | [Alibaba-NLP/gte-Qwen2-7B-instruct](https://hf.co/Alibaba-NLP/gte-Qwen2-7B-instruct) | | 7 | 560M | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) | | 8 | 308M | Gemma3 | [google/embeddinggemma-300m](https://hf.co/google/embeddinggemma-300m) (gated) | | 15 | 1.78B (Expensive) | Qwen2 | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://hf.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | | 18 | 7.11B (Very Expensive) | Mistral | [Salesforce/SFR-Embedding-2_R](https://hf.co/Salesforce/SFR-Embedding-2_R) | | 35 | 568M | XLM-RoBERTa | [Snowflake/snowflake-arctic-embed-l-v2.0](https://hf.co/Snowflake/snowflake-arctic-embed-l-v2.0) | | 41 | 305M | Alibaba GTE | [Snowflake/snowflake-arctic-embed-m-v2.0](https://hf.co/Snowflake/snowflake-arctic-embed-m-v2.0) | | 52 | 335M | BERT | [WhereIsAI/UAE-Large-V1](https://hf.co/WhereIsAI/UAE-Large-V1) | | 58 | 137M | NomicBERT | [nomic-ai/nomic-embed-text-v1](https://hf.co/nomic-ai/nomic-embed-text-v1) | | 79 | 137M | NomicBERT | [nomic-ai/nomic-embed-text-v1.5](https://hf.co/nomic-ai/nomic-embed-text-v1.5) | | 103 | 109M | MPNet | [sentence-transformers/all-mpnet-base-v2](https://hf.co/sentence-transformers/all-mpnet-base-v2) | | N/A | 475M-A305M | NomicBERT | [nomic-ai/nomic-embed-text-v2-moe](https://hf.co/nomic-ai/nomic-embed-text-v2-moe) | | N/A | 434M | Alibaba GTE | [Alibaba-NLP/gte-large-en-v1.5](https://hf.co/Alibaba-NLP/gte-large-en-v1.5) | | N/A | 396M | ModernBERT | [answerdotai/ModernBERT-large](https://hf.co/answerdotai/ModernBERT-large) | | N/A | 137M | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) | | N/A | 137M | JinaBERT | [jinaai/jina-embeddings-v2-base-code](https://hf.co/jinaai/jina-embeddings-v2-base-code) | To explore the list of best performing text embeddings models, visit the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). ## Supported re-rankers and sequence classification models Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions. Below are some examples of the currently supported models: | Task | Model Type | Model ID | |--------------------|-------------|-----------------------------------------------------------------------------------------------------------------| | Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | | Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | | Re-Ranking | GTE | [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) | | Re-Ranking | ModernBert | [Alibaba-NLP/gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | | Sentiment Analysis | RoBERTa | [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) | ## Supported hardware Text Embeddings Inference supports can be used on CPU, Turing (T4, RTX 2000 series, ...), Ampere 80 (A100, A30), Ampere 86 (A10, A40, ...), Ada Lovelace (RTX 4000 series, ...), and Hopper (H100) architectures. The library does **not** support CUDA compute capabilities < 7.5, which means V100, Titan V, GTX 1000 series, etc. are not supported. To leverage your GPUs, make sure to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and use NVIDIA drivers with CUDA version 12.2 or higher. Find the appropriate Docker image for your hardware in the following table: | Architecture | Image | |-------------------------------------|--------------------------------------------------------------------------| | CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 | | Volta | NOT SUPPORTED | | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.8 (experimental) | | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.8 | | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.8 | | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.8 | | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.8 (experimental) | **Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues. You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable. --- # Deploying TEI on Google Cloud Run Deploying Text Embeddings Inference (TEI) on Google Cloud Platform (GCP) allows to benefit from the underlying [Kubernetes](https://kubernetes.io/) technology which ensures that TEI can scale automatically up or down based on demand. On Google Cloud, there are 3 main options for deploying TEI (or any other Docker container): - Cloud Run - Vertex AI endpoints - GKE (Google Kubernetes Engine) This guide explains how to deploy TEI on Cloud Run, a fully managed service by Google. Cloud Run is a so-called serverless offering. This means that the server infrastructure is handled by Google, you only need to provide a Docker container. The benefit of this is that you only pay for compute when there is demand for your application. Cloud Run will automatically spin up servers when there's demand, and scale down to zero when there is no demand. We will showcase how to deploy any text embedding model with or without a GPU. > [!NOTE] > At the time of writing, GPU support on Cloud Run is generally available in 4 regions. If you're interested in using it, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. So far, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests. ## Setup / Configuration This guide already assumes you have set up a Google Cloud project and have enabled billing. This can be done at https://console.cloud.google.com/. First, you need to install the [gcloud CLI](https://cloud.google.com/sdk/docs/install) on your local machine. This allows to programmatically interact with your Google Cloud project. Optionally, to ease the usage of the commands within this tutorial, you need to set the following environment variables for GCP: ```bash export PROJECT_ID=your-project-id export LOCATION=europe-west1 # or any location you prefer: https://cloud.google.com/run/docs/locations export CONTAINER_URI="gcr.io/deeplearning-platform-release/huggingface-text-embeddings-inference-cpu.1-6" export SERVICE_NAME="text-embedding-server" # choose a name for your service export MODEL_ID="ibm-granite/granite-embedding-278m-multilingual" # choose any embedding model ``` Some clarifications: - We provide the latest official Docker image URI based on the [README](https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/tei/README.md). - We choose to deploy the [IBM granite](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual) embedding model given its strong multilingual capabilities. One can of course choose any other embedding model from the hub. It's recommended to look for models tagged with either `feature-extraction`, `sentence-similarity` or `text-ranking`. Then you need to login into your Google Cloud account and set the project ID you want to use to deploy Cloud Run. ```bash gcloud auth login gcloud auth application-default login # For local development gcloud config set project $PROJECT_ID ``` Once you are logged in, you need to enable the Cloud Run API, which is required for the Hugging Face DLC for TEI deployment on Cloud Run. ```bash gcloud services enable run.googleapis.com ``` ## Deploy TEI on Cloud Run Once you are all set, you can call the `gcloud run deploy` command to deploy the Docker image. The command needs you to specify the following parameters: - `--image`: The container image URI to deploy. - `--args`: The arguments to pass to the container entrypoint, being `text-embeddings-inference` for the Hugging Face DLC for TEI. Read more about the supported arguments [here](https://huggingface.co/docs/text-embeddings-inference/cli_arguments). - `--model-id`: The model ID to use, in this case, [`ibm-granite/granite-embedding-278m-multilingual`](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual). - `--quantize`: The quantization method to use. If not specified, it will be retrieved from the `quantization_config->quant_method` in the `config.json` file. - `--max-concurrent-requests`: The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle back pressure correctly. Set to 64, but default is 128. - `--port`: The port the container listens to. - `--cpu` and `--memory`: The number of CPUs and amount of memory to allocate to the container. Needs to be set to 4 and 16Gi (16 GiB), respectively; as that's the minimum requirement for using the GPU. - `--no-cpu-throttling`: Disables CPU throttling, which is required for using the GPU. - `--gpu` and `--gpu-type`: The number of GPUs and the GPU type to use. Needs to be set to 1 and `nvidia-l4`, respectively; as at the time of writing this tutorial, those are the only available options as Cloud Run on GPUs. - `--max-instances`: The maximum number of instances to run, set to 3, but default maximum value is 7. Alternatively, one could set it to 1 too, but that could eventually lead to downtime during infrastructure migrations, so anything above 1 is recommended. - `--concurrency`: the maximum number of concurrent requests per instance, set to 64. Note that this value is also aligned with the [`--max-concurrent-requests`](https://huggingface.co/docs/text-embeddings-inference/cli_arguments) argument in TEI. - `--region`: The region to deploy the Cloud Run service. - `--no-allow-unauthenticated`: Disables unauthenticated access to the service, which is a good practice as adds an authentication layer managed by Google Cloud IAM. > [!NOTE] > Optionally, you can include the arguments `--vpc-egress=all-traffic` and `--subnet=default`, as there is external traffic being sent to the public internet, so in order to speed the network, you need to route all traffic through the VPC network by setting those flags. Note that besides setting the flags, you need to set up Google Cloud NAT to reach the public internet, which is a paid product. Find more information in [Cloud Run Documentation - Networking best practices](https://cloud.google.com/run/docs/configuring/networking-best-practices). > > ```bash > gcloud compute routers create nat-router --network=default --region=$LOCATION > gcloud compute routers nats create vm-nat --router=nat-router --region=$LOCATION --auto-allocate-nat-external-ips --nat-all-subnet-ip-ranges > ``` Finally, you can run the `gcloud run deploy` command to deploy TEI on Cloud Run as: ```bash gcloud run deploy $SERVICE_NAME \ --image=$CONTAINER_URI \ --args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \ --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \ --port=8080 \ --cpu=8 \ --memory=32Gi \ --region=$LOCATION \ --no-allow-unauthenticated ``` If you want to deploy with a GPU, run the following command: ```bash gcloud run deploy $SERVICE_NAME \ --image=$CONTAINER_URI \ --args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \ --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \ --port=8080 \ --cpu=8 \ --memory=32Gi \ --no-cpu-throttling \ --gpu=1 \ --gpu-type=nvidia-l4 \ --max-instances=3 \ --concurrency=64 \ --region=$LOCATION \ --no-allow-unauthenticated ``` Or as it follows if you created the Cloud NAT: ```bash gcloud beta run deploy $SERVICE_NAME \ --image=$CONTAINER_URI \ --args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \ --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \ --port=8080 \ --cpu=8 \ --memory=32Gi \ --no-cpu-throttling \ --gpu=1 \ --gpu-type=nvidia-l4 \ --max-instances=3 \ --concurrency=64 \ --region=$LOCATION \ --no-allow-unauthenticated \ --vpc-egress=all-traffic \ --subnet=default ``` > [!NOTE] > The first time you deploy a new container on Cloud Run it will take around 5 minutes to deploy as it needs to import it from the Google Cloud Artifact Registry, but on the follow up deployments it will take less time as the image has been already imported before. ## Inference Once deployed, you can send requests to the service via any of the supported TEI endpoints, check TEI's [OpenAPI Specification](https://huggingface.github.io/text-embeddings-inference/) to see all the available endpoints and their respective parameters. All Cloud Run services are deployed privately by default, which means that they can’t be accessed without providing authentication credentials in the request headers. These services are secured by IAM and are only callable by Project Owners, Project Editors, and Cloud Run Admins and Cloud Run Invokers. In this case, a couple of alternatives to enable developer access will be showcased; while the other use cases are out of the scope of this example, as those are either not secure due to the authentication being disabled (for public access scenarios), or require additional setup for production-ready scenarios (service-to-service authentication, end-user access). > [!NOTE] > The alternatives mentioned below are for development scenarios, and should not be used in production-ready scenarios as is. The approach below is following the guide defined in [Cloud Run Documentation - Authenticate Developers](https://cloud.google.com/run/docs/authenticating/developers); but you can find every other guide as mentioned above in [Cloud Run Documentation - Authentication overview](https://cloud.google.com/run/docs/authenticating/overview). ### Via Cloud Run Proxy Cloud Run Proxy runs a server on localhost that proxies requests to the specified Cloud Run Service with credentials attached; which is useful for testing and experimentation. ```bash gcloud run services proxy $SERVICE_NAME --region $LOCATION ``` Then you can send requests to the deployed service on Cloud Run, using the http://localhost:8080 URL, with no authentication, exposed by the proxy as shown in the examples below. You can check the API docs at http://localhost:8080/docs in your browser. #### cURL To send a POST request to the TEI service using cURL, you can run the following command: ```bash curl http://localhost:8080/embed \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "model": "tei", "text": "What is deep learning?" }' ``` Alternatively, one can also send requests to the OpenAI-compatible endpoint: ```bash curl http://localhost:8080/v1/embeddings \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "model": "tei", "text": "What is deep learning?" }' ``` #### Python To run inference using Python, you can either use the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/en/index) Python SDK (recommended) or the openai Python SDK. ##### huggingface_hub You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run: ```python from huggingface_hub import InferenceClient client = InferenceClient() embedding = client.feature_extraction("What is deep learning?", model="http://localhost:8080/embed") print(len(embedding[0])) ``` #### OpenAI You can install it via pip as `pip install --upgrade openai`, and then run: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1/embeddings", api_key="") response = client.embeddings.create( model="tei", input="What is deep learning?" ) print(response) ``` ### (recommended) Via Cloud Run Service URL Cloud Run Service has an unique URL assigned that can be used to send requests from anywhere, using the Google Cloud Credentials with Cloud Run Invoke access to the service; which is the recommended approach as it’s more secure and consistent than using the Cloud Run Proxy. The URL of the Cloud Run service can be obtained via the following command (assigned to the SERVICE_URL variable for convenience): ```bash SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $LOCATION --format 'value(status.url)') ``` Then you can send requests to the deployed service on Cloud Run, using the `SERVICE_URL` and any Google Cloud Credentials with Cloud Run Invoke access. For setting up the credentials there are multiple approaches, some of those are listed below: Using the default identity token from the Google Cloud SDK: - Via gcloud as: ```bash gcloud auth print-identity-token ``` - Via Python as: ```bash import google.auth from google.auth.transport.requests import Request as GoogleAuthRequest auth_req = GoogleAuthRequest() creds, _ = google.auth.default() creds.refresh(auth_req) id_token = creds.id_token ``` - Using a Service Account with Cloud Run Invoke access, which can either be done with any of the following approaches: - Create a Service Account before the Cloud Run Service was created, and then set the service-account flag to the Service Account email when creating the Cloud Run Service. And use an Access Token for that Service Account only using `gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT_EMAIL`. - Create a Service Account after the Cloud Run Service was created, and then update the Cloud Run Service to use the Service Account. And use an Access Token for that Service Account only using `gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT_EMAIL`. The recommended approach is to use a Service Account (SA), as the access can be controlled better and the permissions are more granular; as the Cloud Run Service was not created using a SA, which is another nice option, you need to now create the SA, gran it the necessary permissions, update the Cloud Run Service to use the SA, and then generate an access token to set as the authentication token within the requests, that can be revoked later once you are done using it. - Set the SERVICE_ACCOUNT_NAME environment variable for convenience: ```bash export SERVICE_ACCOUNT_NAME=tei-invoker ``` - Create the Service Account: ```bash gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME ``` - Grant the Service Account the Cloud Run Invoker role: ```bash gcloud run services add-iam-policy-binding $SERVICE_NAME \ --member="serviceAccount:$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/run.invoker" \ --region=$LOCATION ``` - Generate the Access Token for the Service Account: ```bash export ACCESS_TOKEN=$(gcloud auth print-access-token --impersonate-service-account=$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com) ``` > The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the --lifetime argument when creating the token. Refer to Access token lifetime to learn more. Otherwise, you can also generate a new token by running the same command again. Now you can already dive into the different alternatives for sending the requests to the deployed Cloud Run Service using the `SERVICE_URL` AND `ACCESS_TOKEN` as described above. #### cURL with Service URL To send a POST request to the TEI service using cURL, you can run the following command: ```bash curl $SERVICE_URL/v1/embeddings \ -X POST \ -H "Authorization: Bearer $ACCESS_TOKEN" \ -H 'Content-Type: application/json' \ -d '{ "model": "tei", "text": "What is deep learning?" }' ``` #### Python with Service URL To run inference using Python, you can either use the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/en/index) Python SDK (recommended) or the openai Python SDK. ##### huggingface_hub (Cloud Run Service URL) You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run: ```python import os from huggingface_hub import InferenceClient client = InferenceClient( base_url=os.getenv("SERVICE_URL"), api_key=os.getenv("ACCESS_TOKEN"), ) embedding = client.feature_extraction("What is deep learning?", model="http://localhost:8080/embed") print(len(embedding[0])) ``` #### OpenAI with Service URL You can install it via pip as `pip install --upgrade openai`, and then run: ```python import os from openai import OpenAI client = OpenAI( base_url=os.getenv("SERVICE_URL"), api_key=os.getenv("ACCESS_TOKEN"), ) response = client.embeddings.create( model="tei", input="What is deep learning?" ) print(response) ``` ## Resource clean up Finally, once you are done using TEI on the Cloud Run Service, you can safely delete it to avoid incurring in unnecessary costs e.g. if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invoke allocation in the free tier. To delete the Cloud Run Service you can either go to the Google Cloud Console at https://console.cloud.google.com/run and delete it manually; or use the Google Cloud SDK via gcloud as follows: ```bash gcloud run services delete $SERVICE_NAME --region $LOCATION ``` Additionally, if you followed the steps in via Cloud Run Service URL and generated a Service Account and an access token, you can either remove the Service Account, or just revoke the access token if it is still valid. - (recommended) Revoke the Access Token as: ```bash gcloud auth revoke --impersonate-service-account=$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com ``` - (optional) Delete the Service Account as: ```bash gcloud iam service-accounts delete $SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com ``` Finally, if you decided to enable the VPC network via Cloud NAT, you can also remove the Cloud NAT (which is a paid product) as: ```bash gcloud compute routers nats delete vm-nat --router=nat-router --region=$LOCATION gcloud compute routers delete nat-router --region=$LOCATION ``` ## References - [Cloud Run documentation - Overview](https://cloud.google.com/run/docs) - [Cloud Run documentation - GPU services](https://cloud.google.com/run/docs/configuring/services/gpu) - [Google Cloud blog - Run your AI inference applications on Cloud Run with NVIDIA GPUs](https://cloud.google.com/blog/products/application-development/run-your-ai-inference-applications-on-cloud-run-with-nvidia-gpus)