# Baseten > ## Documentation Index --- # Source: https://docs.baseten.co/examples/models/mars/MARS6.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # MARS6 > MARS6 is a frontier text-to-speech model by CAMB.AI with voice/prosody cloning capabilities in 10 languages. MARS6 must be licensed for commercial use, we can help! export const MarsIconCard = ({title, href}) => } horizontal />; ## Example usage This model requires at least four inputs: 1. `text`: The input text that needs to be spoken 2. `audio_ref`: An audio file containing the audio of a single person 3. `ref_text`: What is spoken in audio\_ref 4. `language`: The language code for the target language The model will try to output an audio stream containing the speech in the reference audio's style. The output is by default an HTTP1.1 chunked encoding response of an encoded audio file using an ADTS AAC stream, but can be configured to stream using flac format, or to not stream at all and return the entire response as a base64 encoded flac file. ``` data = {"text": "The quick brown fox jumps over the lazy dog", "audio_ref": encoded_str, "ref_text": prompt_txt, "language": 'en-us', # Target language, in this case english. # "top_p": 0.7, # Optionally specify a top_p (default 0.7) # "temperature": 0.7, # Optionally specify a temperature (default 0.7) # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200 # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited) # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5 } ``` ## Input ```python theme={"system"} import base64 import time import torchaudio import requests import IPython.display as ipd import librosa, librosa.display import torch import io from torchaudio.io import StreamReader # Step 1: set endpoint url and api key: url = "" headers = {"Authorization": "Api-Key "} # Step 2: pick reference audio to clone, encode it as base64 file_path = "ref_debug.flac" # any valid audio filepath, ideally between 6s-90s. wav, sr = librosa.load(file_path, sr=None, mono=True, offset=0, duration=5) io_data = io.BytesIO() torchaudio.save(io_data, torch.from_numpy(wav)[None], sample_rate=sr, format="wav") io_data.seek(0) encoded_data = base64.b64encode(io_data.read()) encoded_str = encoded_data.decode("utf-8") # OPTIONAL: specify the transcript of the reference/prompt (slightly speeds up inference, and may make it sound a bit better). prompt_txt = None # if unspecified, can be left as None # Step 3: define other inference settings: data = { "text": "The quick brown fox jumps over the lazy dog", "audio_ref": encoded_str, "ref_text": prompt_txt, "language": "en-us", # Target language, in this case english. # "top_p": 0.7, # Optionally specify a top_p (default 0.7) # "temperature": 0.7, # Optionally specify a temperature (default 0.7) # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200 # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited) # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5 # stream: bool = True # whether to stream the response back as an HTTP1.1 chunked encoding response, or run to completion and return the base64 encoded file. # stream_format: str = "adts" # 'adts' or 'flac' for stream format. Default 'adts' } st = time.time() class UnseekableWrapper: def __init__(self, obj): self.obj = obj def read(self, n): return self.obj.read(n) # Step 4: Send the POST request (note the first request might be a bit slow, but following requests should be fast) response = requests.post(url, headers=headers, json=data, stream=True, timeout=300) streamer = StreamReader(UnseekableWrapper(response.raw)) streamer.add_basic_audio_stream( 11025, buffer_chunk_size=3, sample_rate=44100, num_channels=1 ) # Step 4.1: check the header format of the returned stream response for i in range(streamer.num_src_streams): print(streamer.get_src_stream_info(i)) # Step 5: stream the response back and decode it on-the-fly audio_samples = [] for chunks in streamer.stream(): audio_chunk = chunks[0] audio_samples.append( audio_chunk._elem.squeeze() ) # this is now just a (T,) float waveform, however you can set your own output format bove. print( f"Playing audio chunk of size {audio_chunk._elem.squeeze().shape} at {time.time() - st:.2f}s." ) # If you wish, you can also play each chunk as you receive it, e.g. using IPython: # ipd.display(ipd.Audio(audio_chunk._elem.squeeze().numpy(), rate=44100, autoplay=True)) # Step 6: concatenate all the audio chunks and play the full audio (if you didn't play them on the fly above) final_full_audio = torch.concat(audio_samples, dim=0) # (T,) float waveform @ 44.1kHz # ipd.display(ipd.Audio(final_full_audio.numpy(), rate=44100)) ``` ## Output ```json theme={"system"} { "reuslt": "base64 encoded audio data",\ } ``` --- # Source: https://docs.baseten.co/organization/access.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Access control > Manage access to your Baseten organization with role-based access control. Baseten uses role-based access control (RBAC) to manage organization access. Every organization member has one of two roles. | Permission | Admin | Member | | :----------------------- | ----- | ------ | | Manage members | ✅ | ❌ | | Manage billing | ✅ | ❌ | | Deploy models and Chains | ✅ | ✅ | | Call models | ✅ | ✅ | **Admins** have full control over the organization, including member management and billing. **Members** can deploy and call models but cannot manage organization settings or other users. If your organization uses multiple teams, see [Teams](/organization/teams) for information about team-level roles and permissions. --- # Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-deployment-associated-with-an-environment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Activate environment deployment > Activates an inactive deployment associated with an environment and returns the activation status. ## OpenAPI ````yaml post /v1/models/{model_id}/environments/{env_name}/activate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments/{env_name}/activate: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/env_name' post: summary: Activates a deployment associated with an environment description: >- Activates an inactive deployment associated with an environment and returns the activation status. responses: '200': description: The response to a request to activate a deployment. content: application/json: schema: $ref: '#/components/schemas/ActivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true env_name: schema: type: string name: env_name in: path required: true schemas: ActivateResponseV1: description: The response to a request to activate a deployment. properties: success: default: true description: Whether the deployment was successfully activated title: Success type: boolean title: ActivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Any deployment by ID > Activates an inactive deployment and returns the activation status. ## OpenAPI ````yaml post /v1/models/{model_id}/deployments/{deployment_id}/activate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/{deployment_id}/activate: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/deployment_id' post: summary: Activates a deployment description: Activates an inactive deployment and returns the activation status. responses: '200': description: The response to a request to activate a deployment. content: application/json: schema: $ref: '#/components/schemas/ActivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true deployment_id: schema: type: string name: deployment_id in: path required: true schemas: ActivateResponseV1: description: The response to a request to activate a deployment. properties: success: default: true description: Whether the deployment was successfully activated title: Success type: boolean title: ActivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-development-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Development deployment > Activates an inactive development deployment and returns the activation status. ## OpenAPI ````yaml post /v1/models/{model_id}/deployments/development/activate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/development/activate: parameters: - $ref: '#/components/parameters/model_id' post: summary: Activates a development deployment description: >- Activates an inactive development deployment and returns the activation status. responses: '200': description: The response to a request to activate a deployment. content: application/json: schema: $ref: '#/components/schemas/ActivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: ActivateResponseV1: description: The response to a request to activate a deployment. properties: success: default: true description: Whether the deployment was successfully activated title: Success type: boolean title: ActivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/engines/bis-llm/advanced-features.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Gated features for BIS-LLM > KV-aware routing, disaggregated serving, and other gated features BIS-LLM provides features for large-scale deployments: KV cache optimization, disaggregated serving, and specialized inference strategies. These advanced features are not fully self-serviceable. [Contact us](mailto:support@baseten.co) to enable them for your organization. ## Available advanced features ### Routing and scaling *KV-aware routing* and *disaggregated serving* optimize multi-replica deployments. KV-aware routing directs requests to replicas with the best cache hit potential, while disaggregated serving separates prefill and decode phases into independent clusters that scale separately. *Separate prefill and decode autoscaling* uses token-exact metrics to right-size each phase. ### MoE optimization *WideEP* (expert parallelism) distributes experts across multiple GPUs for extremely large expert counts. These features work together to maximize hardware utilization on models like DeepSeek-V3 and Qwen3MoE. ### Attention and memory *DP attention for MLA* (Multi-Head Latent Attention) compresses KV cache by projecting attention tensors into a compact latent space, *DP attention* helps to managed KV-Cache across GPU ranks, and tunes DeepSeek deployments for high throughput. *DeepSparseAttention* sparsifies the attention matrix based on token relevance. *Distributed KV storage* spreads KV cache across devices for long-context inference beyond single-device memory limits. ### Speculative decoding *Speculative n-gram automata-based decoding* uses automata to predict tokens from n-gram patterns without full model computation. *Speculative MTP or Eagle3 decoding* uses draft-model approaches to predict and verify multiple future tokens. ### Kernel optimization *Zero-overlap scheduling* overlaps computation and communication to hide latency. *Auto-tuned kernels* optimize kernel parameters for your specific hardware and model topology. ## KV-aware routing KV-aware routing directs requests to replicas with the best chance of KV cache hits, routing based on cache availability and replica utilization. KV-aware routing reduces inter-token latency by distributing load across replicas, improves time-to-first-token through cache hits on repeated queries, and increases global throughput through cache reuse. ## Disaggregated serving Disaggregated serving separates prefill and decode phases into independent clusters, allowing each to scale and be optimized independently. This architecture is particularly valuable for large MoE models. Disaggregated serving is available as a gated feature. [Contact us](mailto:support@baseten.co) to be paired with an engineer to discuss your needs. Disaggregated serving enables independent scaling of prefill and decode resources, isolates time-critical TTFT metrics from throughput-focused phases, and optimizes costs by right-sizing each phase for its workload. ## Get started ### Choose the right configuration **For advanced deployments** with large MoE models and planet-scale inference, [contact us](mailto:support@baseten.co). **For standard deployments**: Use the standard BIS-LLM configuration as documented in [BIS-LLM configuration](/engines/bis-llm/bis-llm-config). ## Model recommendations ### Models that benefit from advanced features **Large MoE models:** * DeepSeek-V3 * Qwen3MoE * Kimi-K2 * GLM-4.7 * GPT-OSS **Ideal use cases:** * High-throughput API services * Complex reasoning tasks * Long-context applications, including agentic coding * Planet-scale deployments ### When to use standard BIS-LLM or Engine-Builder-LLM * Dense models under 70B parameters * Standard MoE models under 30B parameters * Development and testing environments * Workloads with low KV cache hit rates ## Further reading * [BIS-LLM overview](/engines/bis-llm/overview): Main engine documentation. * [BIS-LLM reference config](/engines/bis-llm/bis-llm-config): Configuration options. * [Structured outputs documentation](/engines/performance-concepts/structured-outputs): JSON schema validation. * [Examples section](/examples/overview): Deployment examples. --- # Source: https://docs.baseten.co/examples/models/microsoft/all-mpnet-base-v2.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # All MPNet Base V2 > A text embedding model with a context window of 384 tokens and a dimensionality of 768 values. export const MicrosoftIconCard = ({title, href}) => } horizontal />; ## Example usage This model takes a list of strings and returns a list of embeddings, where each embedding is a list of 768 floating-point number representing the semantic text embedding of the associated string. Strings can be up to 384 tokens in length (approximately 280 words). If the strings are longer, they'll be truncated before being run through the embedding model. ```python theme={"system"} import requests import os # Replace the empty string with your model id below model_id = "" baseten_api_key = os.environ["BASETEN_API_KEY"] data = { "text": ["I want to eat pasta", "I want to eat pizza"], } # Call model endpoint res = requests.post( f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data ) # Print the output of the model print(res.json()) ``` ## JSON output ```json theme={"system"} [ [0.2593194842338562, "...", -1.4059709310531616], [0.11028853803873062, "...", -0.9492666125297546] ] ``` --- # Source: https://docs.baseten.co/organization/api-keys.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # API keys > Authenticate requests to Baseten for deployment, inference, and management. API keys authenticate your requests to Baseten. You need an API key to: * Deploy models, Chains, and training projects with the Truss CLI. * Call model endpoints for inference. * Use the management API. ## API key types Baseten supports two types of API keys: **Personal API keys** are tied to your user account. Actions performed with a personal key are attributed to you. Use personal keys for local development and testing. **Team API keys** are not tied to an individual user. When your organization has [teams](/organization/teams) enabled, team keys can be scoped to a specific team. Team keys can have different permission levels: * **Full access** - Deploy models, call endpoints, and manage resources. * **Inference only** - Call model endpoints but cannot deploy or manage. * **Metrics only** - Export metrics but cannot deploy or call models. Use team keys for CI/CD pipelines, production applications, and shared automation. If your organization uses [teams](/organization/teams), Team Admins can create team API keys scoped to their team. See [Teams](/organization/teams) for more information. ## Create an API key To create an API key: 1. Navigate to [API keys](https://app.baseten.co/settings/api_keys) in your account settings. 2. Select **Create API key**. 3. Choose **Personal** or **Team** key type. 4. Enter a name for the key (lowercase letters, numbers, and hyphens only). 5. For team keys, select the permission level. 6. Select **Next**. Copy the key immediately, you won't be able to view it again. ## Use API keys with the CLI The first time you run `truss push`, the CLI prompts you for your API key and saves it to `~/.trussrc`: ``` $ truss push --watch 💻 Let's add a Baseten remote! 🤫 Quietly paste your API_KEY: 💾 Remote config `baseten` saved to `~/.trussrc`. ``` To manually configure or update your API key, edit `~/.trussrc`: ```sh theme={"system"} [baseten] remote_provider = baseten api_key = YOUR_API_KEY ``` ## Use API keys with endpoints To call model endpoints with your API key, see [Call your model](/inference/calling-your-model). ## Manage API keys The [API keys page](https://app.baseten.co/settings/api_keys) shows all your keys with their creation date and last used timestamp. Use this information to identify unused keys. To rename a key, select the pencil icon next to the key name. To rotate a key, create a new key, update your applications to use it, then revoke the old key. To revoke a key, select the trash icon next to the key. Revoked keys cannot be restored. You can also manage API keys programmatically with the [REST API](/reference/management-api/api-keys/creates-an-api-key). ### Security recommendations * Store API keys in environment variables or secret managers, not in code. * Never commit API keys to version control. * Use team keys with minimal permissions for production applications. * Rotate keys periodically and revoke unused keys. --- # Source: https://docs.baseten.co/inference/async.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async inference > Run asynchronous inference on deployed models Async inference is a *fire and forget* pattern for model requests. Instead of waiting for a response, you receive a request ID immediately while inference runs in the background. When complete, results are delivered to your webhook endpoint. Async requests work with any deployed model, no code changes are required. Requests can queue for up to 72 hours and run for up to 1 hour. Async inference is not compatible with streaming output. Use async inference for: * **Long-running tasks** that would otherwise hit request timeouts. * **Batch processing** where you don't need immediate responses. * **Priority queuing** to serve VIP customers faster. Baseten does not store model outputs. If webhook delivery fails after all retries, your data is lost. See [Webhook delivery](#webhook-delivery) for mitigation strategies. ## Quick start Create an HTTPS endpoint to receive results. Use [this Repl](https://replit.com/@baseten-team/Baseten-Async-Inference-Starter-Code) as a starting point, or deploy to any service that can receive POST requests. Call your model's `/async_predict` endpoint with your webhook URL: ```python theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" webhook_endpoint = "YOUR_WEBHOOK_ENDPOINT" baseten_api_key = os.environ["BASETEN_API_KEY"] # Call the async_predict endpoint of the production deployment resp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint, # "priority": 0, # "max_time_in_queue_seconds": 600, }, ) print(resp.json()) ``` You'll receive a `request_id` immediately. When inference completes, Baseten sends a POST request to your webhook with the model output. See [Webhook payload](#webhook-payload) for the response format. **Chains** support async inference through `async_run_remote`. Inference requests to the entrypoint are queued, but internal Chainlet-to-Chainlet calls run synchronously. ## How async works Async inference decouples request submission from processing, letting you queue work without waiting for results. ### Request lifecycle When you submit an async request: 1. You call `/async_predict` and immediately receive a `request_id`. 2. Your request enters a queue managed by the Async Request Service. 3. A background worker picks up your request and calls your model's predict endpoint. 4. Your model runs inference and returns a response. 5. Baseten sends the response to your webhook URL using POST. The `max_time_in_queue_seconds` parameter controls how long a request waits before expiring. It defaults to 10 minutes but can extend to 72 hours. ### Autoscaling behavior The async queue is decoupled from model scaling. Requests queue successfully even when your model has zero replicas. When your model is scaled to zero: 1. Your request enters the queue while the model has no running replicas. 2. The queue processor attempts to call your model, triggering the autoscaler. 3. Your request waits while the model cold-starts. 4. Once the model is ready, inference runs and completes. 5. Baseten delivers the result to your webhook. If the model doesn't become ready within `max_time_in_queue_seconds`, the request expires with status `EXPIRED`. Set this parameter to account for your model's startup time. For models with long cold starts, consider keeping minimum replicas running using [autoscaling settings](/deployment/autoscaling). ### Async priority Async requests are subject to two levels of priority: how they compete with sync requests for model capacity, and how they're ordered relative to other async requests in the queue. #### Sync vs async concurrency Sync and async requests share your model's concurrency pool, controlled by `predict_concurrency` in your model configuration: ```yaml config.yaml theme={"system"} runtime: predict_concurrency: 10 ``` The `predict_concurrency` setting defines how many requests your model can process simultaneously per replica. When both sync and async requests are in flight, sync requests take priority. The queue processor monitors your model's capacity and backs off when it receives 429 responses, ensuring sync traffic isn't starved. For example, if your model has `predict_concurrency=10` and 8 sync requests are running, only 2 slots remain for async requests. The remaining async requests stay queued until capacity frees up. #### Async queue priority Within the async queue itself, you can control processing order using the `priority` parameter. This is useful for serving specific requests faster or ensuring critical batch jobs run before lower-priority work. ```python theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" webhook_endpoint = "YOUR_WEBHOOK_URL" baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "webhook_endpoint": webhook_endpoint, "model_input": {"prompt": "hello world!"}, "priority": 0, }, ) print(resp.json()) ``` The `priority` parameter accepts values 0, 1, or 2. Lower values indicate higher priority: a request with `priority: 0` is processed before requests with `priority: 1` or `priority: 2`. If you don't specify a priority, requests default to priority 1. Use priority 0 sparingly for truly urgent requests. If all requests are marked priority 0, the prioritization has no effect. ## Webhooks Baseten delivers async results to your webhook endpoint when inference completes. ### Request format When inference completes, Baseten sends a POST request to your webhook with these headers and body: ```text theme={"system"} POST /your-webhook-path HTTP/2.0 Content-Type: application/json X-BASETEN-REQUEST-ID: 9876543210abcdef1234567890fedcba X-BASETEN-SIGNATURE: v1=abc123... ``` The `X-BASETEN-REQUEST-ID` header contains the request ID for correlating webhooks with your original requests. The `X-BASETEN-SIGNATURE` header is only included if a [webhook secret](#secure-webhooks) is configured. Webhook endpoints must use HTTPS (except `localhost` for development). Baseten supports HTTP/2 and HTTP/1.1 connections. ```json theme={"system"} { "request_id": "9876543210abcdef1234567890fedcba", "model_id": "abc123", "deployment_id": "def456", "type": "async_request_completed", "time": "2024-04-30T01:01:08.883423Z", "data": { "output": "model response here" }, "errors": [] } ``` The body contains the `request_id` matching your original `/async_predict` response, along with `model_id` and `deployment_id` identifying which deployment ran the request. The `data` field contains your model output, or `null` if an error occurred. The `errors` array is empty on success, or contains error objects on failure. ### Webhook delivery If all delivery attempts fail, your model output is permanently lost. Baseten delivers webhooks on a best-effort basis with automatic retries: | Setting | Value | | --------------- | -------------------------- | | Total attempts | 3 (1 initial + 2 retries). | | Backoff | 1 second, then 4 seconds. | | Timeout | 10 seconds per attempt. | | Retryable codes | 500, 502, 503, 504. | **To prevent data loss:** 1. **Save outputs in your model.** Use the `postprocess()` function to write to cloud storage: ```python theme={"system"} import json import boto3 class Model: # ... def postprocess(self, model_output): s3 = boto3.client("s3") s3.put_object( Bucket="my-bucket", Key=f"outputs/{self.context.get('request_id')}.json", Body=json.dumps(model_output) ) return model_output ``` This will process your model output and save it to your desired location. The `postprocess` method runs after inference completes. Use `self.context.get('request_id')` to access the async request ID for correlating outputs with requests. 2. **Use a reliable endpoint.** Deploy your webhook to a highly available service like a cloud function or message queue. ### Secure webhooks Create a webhook secret in the [Secrets tab](https://app.baseten.co/settings/secrets) to verify requests are from Baseten. When configured, Baseten includes an `X-BASETEN-SIGNATURE` header: ```text theme={"system"} X-BASETEN-SIGNATURE: v1=abc123... ``` To validate, compute an HMAC-SHA256 of the request body using your secret and compare: ```python theme={"system"} import hashlib import hmac def verify_signature(body: bytes, signature: str, secret: str) -> bool: expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest() actual = signature.replace("v1=", "").split(",")[0] return hmac.compare_digest(expected, actual) ``` The function computes an HMAC-SHA256 hash of the raw request body using your webhook secret. It extracts the signature value after `v1=` and uses `compare_digest` for timing-safe comparison to prevent timing attacks. Rotate secrets periodically. During rotation, both old and new secrets remain valid for 24 hours. ## Manage requests You can check the status of async requests or cancel them while they're queued. ### Check request status To check the status of an async request, call the status endpoint with your request ID: ```python theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" request_id = "YOUR_REQUEST_ID" baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://model-{model_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` Status is available for 1 hour after completion. See the [status API reference](/reference/inference-api/status-endpoints/get-async-request-status) for details. | Status | Description | | ---------------- | ------------------------------------------------ | | `QUEUED` | Waiting in queue. | | `IN_PROGRESS` | Currently processing. | | `SUCCEEDED` | Completed successfully. | | `FAILED` | Failed after retries. | | `EXPIRED` | Exceeded `max_time_in_queue_seconds`. | | `CANCELED` | Canceled by user. | | `WEBHOOK_FAILED` | Inference succeeded but webhook delivery failed. | ### Cancel a request Only `QUEUED` requests can be canceled. To cancel a request, call the cancel endpoint with your request ID: ```python theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" request_id = "YOUR_REQUEST_ID" baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.delete( f"https://model-{model_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` For more information, see the [cancel async request API reference](/reference/inference-api/predict-endpoints/cancel-async-request). ## Error codes When inference fails, the webhook payload returns an `errors` array: ```json theme={"system"} { "errors": [{ "code": "MODEL_PREDICT_ERROR", "message": "Details here" }] } ``` | Code | HTTP | Description | Retried | | ----------------------- | ------- | ------------------------------- | ------- | | `MODEL_NOT_READY` | 400 | Model is loading or starting. | Yes | | `MODEL_DOES_NOT_EXIST` | 404 | Model or deployment not found. | No | | `MODEL_INVALID_INPUT` | 422 | Invalid input format. | No | | `MODEL_PREDICT_ERROR` | 500 | Exception in `model.predict()`. | Yes | | `MODEL_UNAVAILABLE` | 502/503 | Model crashed or scaling. | Yes | | `MODEL_PREDICT_TIMEOUT` | 504 | Inference exceeded timeout. | Yes | ### Inference retries When inference fails with a retryable error, Baseten automatically retries the request using exponential backoff. Configure this behavior with `inference_retry_config`: ```python theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" webhook_endpoint = "YOUR_WEBHOOK_URL" baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint, "inference_retry_config": { "max_attempts": 3, "initial_delay_ms": 1000, "max_delay_ms": 5000 } }, ) print(resp.json()) ``` | Parameter | Range | Default | Description | | ------------------ | -------- | ------- | ------------------------------------------------ | | `max_attempts` | 1-10 | 3 | Total inference attempts including the original. | | `initial_delay_ms` | 0-10,000 | 1000 | Delay before the first retry (ms). | | `max_delay_ms` | 0-60,000 | 5000 | Maximum delay between retries (ms). | Retries use exponential backoff with a multiplier of 2. With the default configuration, delays progress as: 1s → 2s → 4s → 5s (capped at `max_delay_ms`). Only requests that fail with retryable error codes (500, 502, 503, 504) are retried. Non-retryable errors like invalid input (422) or model not found (404) fail immediately. Inference retries are distinct from [webhook delivery retries](#webhook-delivery). Inference retries happen when calling your model fails. Webhook retries happen when delivering results to your endpoint fails. ## Rate limits There are rate limits for the async predict endpoint and the status polling endpoint. If you exceed these limits, you will receive a 429 status code. | Endpoint | Limit | | -------------------------------------------- | ----------------------------------- | | Predict endpoint requests (`/async_predict`) | 12,000 requests/minute (org-level). | | Status polling | 20 requests/second. | | Cancel request | 20 requests/second. | Use webhooks instead of polling to avoid status endpoint limits. Contact [support@baseten.co](mailto:support@baseten.co) to request increases. ## Observability Async metrics are available on the [Metrics tab](/observability/metrics#async-queue-metrics) of your model dashboard: * **Inference latency/volume**: includes async requests. * **Time in async queue**: time spent in `QUEUED` state. * **Async queue size**: number of queued requests. ## Resources For more information and resources, see the following: Fork this Repl to quickly set up a webhook endpoint for testing async inference. Configure webhook secrets in your Baseten settings to secure webhook delivery. --- # Source: https://docs.baseten.co/engines/performance-concepts/autoscaling-engines.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Auto-Scaling Engines > Performant auto-scaling custom tailored to Embedding and Generation Models on Baseten # Auto-Scaling Engines Beyond the [Introduction to autoscaling](/deployment/autoscaling), some adjustments specialized to models using dynamic batching are helpful. Both BEI and Engine-Builder-LLM use **dynamic batching** to process parallel multiple requests. This increase in throughput comes at the cost of increased p50 latency. Combining this feature with engine-specific autoscaling becomes a powerful tool for maintaining optimal performance across varying traffic patterns. ## BEI BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly. **Key recommendations:** * **Enable autoscaling** - BEI's millisecond-range inference and dynamic batching require autoscaling to handle variable traffic efficiently * **Target utilization: 25%** - Low target provides headroom for traffic spikes and accommodates dynamic batching behavior * **Concurrency: 96+ requests** - High concurrency allows maximum throughput. If unsure, start with 64 and 40% utilization and tune on live traffic. * **Minimum concurrency: ≥8** - Never set below 8 for optimal performance **Multi-payload routes** (`/rerank`, `/v1/embeddings`) can send multiple requests at once, challenging autoscaling based on concurrent requests. Use the [Performance client](/engines/performance-concepts/performance-client) for optimal scaling. ## Engine-Builder-LLM Engine-Builder-LLM uses dynamic batching to maximize throughput, similar to BEI, but doesn't face the multi-payload challenge that BEI does with `/rerank` and `/v1/embeddings` routes. **Key recommendations:** * **Target utilization: 40-50%** - Lower than default to accommodate dynamic batching and provide headroom * **Concurrency: 16-256 requests** - If unsure, start with 64 and 40% utilization and tune on live traffic. * **Batch cases** - Use the Performance client for batch processing * **Minimum concurrency: ≥8** - Never set below 8 for optimal performance * **Lookahead works slightly better with lower batch-size** - Tune the concurrency to a same or slightly below `max_batch_size`, so that lookahead is aware that it can perform optimizations. This is partially also helpful for any `engine-builder-llm` engine, even if you're not using lookahead. **Important**: Do not set concurrency above `max_batch_size` as it leads to on-replica queueing and negates the benefits of autoscaling. General advice: Tune the equilibrium on your live-traffic, cost, thoughput and latency targets. Your mean expected concurrency will be the concurrency\_target \* target\_utilization. Most engines are only provide marginal thoughput improvements when paired with 128 requests vs working on 256 requests at a time. Keeping a mean expected concurrency around 16-64 will allow for the best stability guarantees and proactive scaling descisions under variable traffic. ## Quick Reference | **Setting** | **BEI** | **Engine-Builder-LLM** | | ------------------ | ------------ | ---------------------- | | Target utilization | 25% | 40-50% | | Concurrency | 96+ (min ≥8) | 32-256 | | Batch size | Flexible | Flexible | ## Further reading * [BEI overview](/engines/bei/overview) - General BEI documentation * [BEI reference config](/engines/bei/bei-reference) - Complete configuration options * [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview) - Generation model details * [Embedding examples](/examples/bei) - Concrete deployment examples * [Performance client documentation](/engines/performance-concepts/performance-client) - Client usage with embeddings * [Quantization guide](/engines/performance-concepts/quantization-guide) - Hardware considerations * [Performance optimization](/development/model/performance-optimization) - General performance guidance --- # Source: https://docs.baseten.co/deployment/autoscaling.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Autoscaling > Autoscaling dynamically adjusts the number of active replicas to **handle variable traffic** while minimizing idle compute costs. ## Configuring autoscaling Autoscaling settings are **per deployment** and are inherited when promoting a model to production unless overridden. Configure autoscaling through: * **UI** → Manage settings in your Baseten workspace. * **API** → Use the **[autoscaling API](/reference/management-api/deployments/autoscaling)**. ### Replica scaling Each deployment scales within a configured range of replicas: * **Minimum replicas** → The lowest number of active replicas. * Default: `0` (scale to zero). * Maximum value: Cannot exceed the **maximum replica count**. * **Maximum replicas** → The upper limit of active replicas. * Default: `1`. * Max: `10` by default (contact support to increase). When first deployed, the model starts with `1` replica (or the **minimum count**, if higher). As traffic increases, additional replicas **scale up** until the **maximum count** is reached. When traffic decreases, replicas **scale down** to match demand. *** ## Autoscaler settings The **autoscaler logic** is controlled by three key parameters: * **Autoscaling window** → Time window for traffic analysis before scaling up/down. Default: 60 seconds. * **Scale down delay** → Time before an unused replica is removed. Default: 900 seconds (15 minutes). * **Concurrency target** → Number of requests a replica should handle before scaling. Default: 1 request. * **Target Utilization Percentage** → Target percentage of filled concurrency slots. Default: 70%. A **short autoscaling window** with a **longer scale-down delay** is recommended for **fast upscaling** while maintaining capacity during temporary dips. The **target utilization percentage** determines the amount of headroom available. A higher number means less headroom and more usage on each replica, where a lower number means more headroom and buffer for traffic spikes. *** ## Autoscaling behavior ### Scaling up When the **average requests per active replica** exceed the **concurrency target** within the **autoscaling window**, more replicas are created until: * The **concurrency target is met**, or * The **maximum replica count** is reached. Note here that the amount of headroom is determined by the **target utilization percentage**. For example, with a concurrency target of 10 requests and a target utilization percentage of 70%, scaling will begin when the average requests per active replica exceeds 7. ### Scaling down When traffic drops below the **concurrency target**, excess replicas are flagged for removal. The **scale-down delay** ensures that replicas are not removed prematurely: * If traffic **spikes again before the delay ends**, replicas remain active. * If the **minimum replica count** is reached, no further scaling down occurs. *** ## Scale to zero If you're just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money. Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests. To turn on scale to zero, just set a deployment's minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config. Models that have not received any traffic for more than **two weeks** will be automatically deactivated. These models will need to be activated manually before they can serve requests again. For **production deployments this threshold is two months**. *** ## Cold starts A **cold start** is the time required to **initialize a new replica** when scaling up. Cold starts impact: * **Scaled-to-zero deployments** → The first request must wait for a new replica to start. * **Scaling events** → When traffic spikes and a deployment requires more replicas. ### Cold start optimizations **Network accelerator** Baseten speeds up model loading from **Hugging Face, CloudFront, S3, and OpenAI** using parallelized **byte-range downloads**, reducing cold start delays. **Cold start pods** Baseten pre-warms specialized **cold start pods** to accelerate loading times. These pods appear in logs as `[Coldboost]`. ```md Example coldboost log line theme={"system"} Oct 09 9:20:25pm [Coldboost] Completed model.load() execution in 12650 ms ``` **Model Image streaming and optimization** To further reduce initialization latency, Baseten uses **image streaming** to optimize container startup. 1. **Initial non-optimized image:** When a model is first deployed, a standard image is built without optimization. During this stage, the runtime monitors which parts of the image are accessed during startup and inference. 2. **Call graph–based optimization:** Baseten analyzes the model’s call graph to identify which layers, weights, and binaries are actually needed during initialization. This information drives creation of an **optimized image**. 3. **Prefetch and lazy fetch:** The optimized image is split into two content groups: * **Prefetched content:** Frequently accessed layers and dependencies are loaded eagerly at container start. * **Lazy-fetched content:** Less critical data is fetched on demand, reducing initial I/O overhead. 4. **Streaming-enabled image pull:** Images optimized through this process are streamed into the node filesystem during startup, allowing the model to begin loading before the entire image is downloaded. Pulling an optimized image appears in logs as: ```md Example streaming image pull log line theme={"system"} Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB. ``` *** ## Autoscaling for development deployments Development deployments have **fixed autoscaling constraints** to optimize for **live reload workflows**: * **Min replicas:** `0` * **Max replicas:** `1` * **Autoscaling window:** `60 seconds` * **Scale down delay:** `900 seconds (15 min)` * **Concurrency target:** `1 request` To enable full autoscaling, **promote the deployment and environment** like production. --- # Source: https://docs.baseten.co/development/model/b10cache.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # b10cache 🆕 > Persist data across replicas or deployments ### Early Access Please contact our [support team](mailto:support@baseten.co) for access to b10cache. Deployments sometimes have cache or other files that are useful to other replicas. Using `torch.compile` results in a cache that can speed up future `torch.compile` on the same function. This can speed up other replicas' cold start times. **These files can be stored via b10cache**. b10cache is a volume mounted over the network onto each of your pods. There are two ways files can be stored: #### 1. `/cache/org/` This directory is shared, and can be written to or accessed by every pod you deploy. Simply move a file into here and it will be accessible. #### 2. `/cache/model/` This directory is shared by every pod within the scope of your deployment. This is excellent for keeping filesystems clean and limiting access. ### Not a persistent object storage While b10cache is very reliable, it should not be used as a persistent object storage or database. **It should be considered a cache** that can be shared by deployments, meaning there should always be a fallback plan if the b10cache path does not exist. See two features built on b10cache: 1. [*model cache*](/development/model/model-cache) 2. [*torch compile cache*](/development/model/torch-compile-cache) --- # Source: https://docs.baseten.co/development/model/base-images.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Base Docker images > A guide to configuring a base image for your truss Truss uses containerized environments to ensure consistent model execution across deployments. While the default Truss image works for most cases, you may need a custom base image to meet specific package or system requirements. ## Setting a base image in`config.yaml` Specify a custom base image in `config.yaml`: ```yaml config.yaml theme={"system"} base_image: image: python_executable_path: ``` * `image`: The Docker image to use. * `python_executable_path`: The path to the Python binary inside the container. ### Example: NVIDIA NeMo Model Using a custom image to deploy [NVIDIA NeMo TitaNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large) model: ```yaml config.yaml theme={"system"} base_image: image: nvcr.io/nvidia/nemo:23.03 python_executable_path: /usr/bin/python apply_library_patches: true requirements: - PySoundFile resources: accelerator: T4 cpu: 2500m memory: 4512Mi use_gpu: true secrets: {} system_packages: - python3.8-venv ``` ## Using Private Base Images If your base image is private, ensure that you have configured your model to use a [private registry](/development/model/private-registries) ## Creating a custom base image You can build a new base image using Truss’s base images as a foundation. Available images are listed on [Docker Hub](https://hub.docker.com/r/baseten/truss-server-base/tags). #### Example: Customizing a Truss Base Image ```Dockerfile Dockerfile theme={"system"} FROM baseten/truss-server-base:3.11-gpu-v0.7.16 RUN pip uninstall cython -y RUN pip install cython==0.29.30 ``` #### Building & Pushing Your Custom Image Ensure Docker is installed and running. Then, build, tag, and push your image: ```sh theme={"system"} docker build -t my-custom-base-image:0.1 . docker tag my-custom-base-image:0.1 your-docker-username/my-custom-base-image:0.1 docker push your-docker-username/my-custom-base-image:0.1 ``` --- # Source: https://docs.baseten.co/training/concepts/basics.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Basics > Learn how to get up and running on Baseten Training This page covers the essential building blocks of Baseten Training. These are the core concepts you'll need to understand to effectively organize and execute your training workflows. ## How Baseten Training works Baseten Training jobs can be launched from any terminal. Training jobs are created from within a directory, and when created, that directory is packaged up and can be pushed up to Baseten. This allows you to define your Baseten training config, scripts, code, and any other dependencies within the folder. Within the folder, we require you to include a Baseten training config file such as `config.py`. The `config.py` includes a list of `run_commands`, which can be anything from running a Python file (`python train.py`) to a bash script (`chmod +x run.sh && ./run.sh`). If you're looking to upload more than 1GB of files, we strongly suggest uploading your data to an object store and including a download command before running your training code. To avoid duplicate downloads, check out our documentation on the [cache](/training/concepts/cache). ## Setting up your workspace If you'd like to start from one of our existing recipes, you can check out one of the following examples: **Simple CPU job with raw PyTorch:** ```bash theme={"system"} truss train init --examples mnist-pytorch ``` **More complex example that trains GPT-OSS-20b:** ```bash theme={"system"} truss train init --examples oss-gpt-20b-axolotl ``` Your `config.py` contains all infrastructure configuration for your job, which we will cover below. Your `run.sh` is invoked by the command that runs when the job first begins. Here you can install any Python dependencies not already included in your Docker image, and begin the execution of your code either by calling a Python file with your training code or a launch command. ## Organizing your work with `TrainingProject`s A `TrainingProject` is a lightweight organization tool to help you group different `TrainingJob`s together. While there a few technical details to consider, your team can use `TrainingProject`s to facilitate collaboration and organization. ## Running a `TrainingJob` Once you have a `TrainingProject`, the actual work of training a model happens within a **`TrainingJob`**. Each `TrainingJob` represents a single, complete execution of your training script with a specific configuration. * **What it is:** A `TrainingJob` is the fundamental unit of execution. It bundles together: * Your training code. * A base `image`. * The `compute` resources needed to run the job. * The `runtime` configurations like startup commands and environment variables. * **Why use it:** Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new `TrainingJob`s while knowing that previous ones have been persisted on Baseten. * **Lifecycle:** A job goes through various stages, from being created (`TRAINING_JOB_CREATED`), to resources being set up (`TRAINING_JOB_DEPLOYING`), to actively running your script (`TRAINING_JOB_RUNNING`), and finally to a terminal state like `TRAINING_JOB_COMPLETED`. More details on the job lifecycle can be found on the [Lifecycle](/training/lifecycle) page. ## Compute resources The `Compute` configuration defines the computational resources your training job will use. This includes: * **GPU specifications** - Choose from various GPU types based on your model's requirements * **CPU and memory** - Configure the amount of CPU and RAM allocated to your job * **Node count** - For single-node or multi-node training setups Baseten Training supports H100, H200, and A10G GPUs. Choose your GPU type based on your model's memory requirements and performance needs. ## Base images Baseten provides pre-configured base images that include common ML frameworks and dependencies. These images are optimized for training workloads and include: * Popular ML frameworks (PyTorch, VERL, Megatron, Axolotl, etc.) * GPU drivers and CUDA support * Common data science libraries You can also use [custom or private images](/development/model/private-registries) if you have specific requirements. ## Securely integrate with external services with `SecretReference` Successfully training a model often requires many tools and services. Baseten provides **`SecretReference`** for secure handling of secrets. * **How to use it:** Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job's configuration (e.g., environment variables), you refer to this secret by its name using `SecretReference`. The actual secret value is never exposed in your code. * **How it works:** Baseten injects the secret value at runtime under the environment variable name that you specify. ```python theme={"system"} from truss_train import definitions runtime = definitions.Runtime( # ... other runtime options environment_variables={ "HF_TOKEN": definitions.SecretReference(name="hf_access_token"), }, ) ``` ## Running inference on trained models The journey from training to a usable model in Baseten typically follows this path: 1. A `TrainingJob` with checkpointing enabled, produces one or more model artifacts. 2. You run `truss train deploy_checkpoint` to deploy a model from your most recent training job. You can read more about this at [Serving Trained Models](/training/deployment). 3. Once deployed, your model will be available for inference via API. See more at [Calling Your Model](/inference/calling-your-model). ## Next steps: advanced topics Now that you understand the basics of Baseten Training, explore these advanced topics to optimize your training workflows: * **[Cache](/training/concepts/cache)** - Speed up your training iterations by persisting data between jobs and avoiding expensive downloads * **[Checkpointing](/training/concepts/checkpointing)** - Manage model checkpoints seamlessly and avoid disk errors during training * **[Multinode Training](/training/concepts/multinode)** - Scale your training across multiple nodes with high-speed infiniband networking --- # Source: https://docs.baseten.co/engines/bei/bei-bert.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # BEI-Bert > BERT-optimized embeddings with cold-start performance BEI-Bert is a specialized variant of Baseten Embeddings Inference optimized for BERT-based model architectures. It provides superior cold-start performance and 16-bit precision for models that benefit from bidirectional attention patterns. ## When to use BEI-Bert ### Ideal use cases **Model architectures:** * **Sentence-transformers**: `sentence-transformers/all-MiniLM-L6-v2` * **Jina models**: `jinaai/jina-embeddings-v2-base-en`, `jinaai/jina-embeddings-v2-base-code` * **Nomic models**: `nomic-ai/nomic-embed-text-v1.5`, `nomic-ai/nomic-embed-code-v1.5` * **BERT variants**: `FacebookAI/roberta-base`, `cardiffnlp/twitter-roberta-base` * **Gemma3Bidirectional**: `google/embeddinggemma-300m` * **ModernBERT**: `answerdotai/ModernBERT-base` * **Qwen2Bidirectional**: `Alibaba-NLP/gte-Qwen2-7B-instruct` * **QWen3Bidirectional** `voyageai/voyage-4-nano` * **LLama3Bidrectional** `nvidia/llama-embed-nemotron-8b` **Deployment scenarios:** * **Cold-start sensitive applications**: Where first-request latency is critical * **Small to medium models**: (under 4B parameters) where quantization isn't needed * **High-accuracy requirements**: Where 16-bit precision is preferred * **Bidirectional attention**: Models with bidirectional attention run best on this engine. ### BEI-Bert vs BEI comparison | Feature | BEI-Bert | BEI | | ------------ | ------------------------------------ | --------------------------------- | | Architecture | BERT-based (bidirectional) | Causal (unidirectional) | | Precision | FP16 (16-bit) | BF16/FP16/FP8/FP4 (quantized) | | Cold-start | Optimized for fast initialization | Standard startup | | Quantization | Not supported | FP8/FP4 supported | | Memory usage | Lower for small models | Higher or equal | | Throughput | 600-900 embeddings/sec | 800-1400 embeddings/sec | | Best for | Small BERT models, accuracy-critical | Large models, throughput-critical | ## Recommended models (MTEB ranking) ### Top-tier embeddings **High performance (rank 2-8):** * `Alibaba-NLP/gte-Qwen2-7B-instruct` (7.61B): Bidirectional. * `intfloat/multilingual-e5-large-instruct` (560M): Multilingual. * `google/embeddinggemma-300m` (308M): Google's compact model. **Mid-range performance (rank 15-35):** * `Alibaba-NLP/gte-Qwen2-1.5B-instruct` (1.78B): Cost-effective. * `Salesforce/SFR-Embedding-2_R` (7.11B): Salesforce model. * `Snowflake/snowflake-arctic-embed-l-v2.0` (568M): Snowflake large. * `Snowflake/snowflake-arctic-embed-m-v2.0` (305M): Snowflake medium. **Efficient models (rank 52-103):** * `WhereIsAI/UAE-Large-V1` (335M): UAE large model. * `nomic-ai/nomic-embed-text-v1` (137M): Nomic original. * `nomic-ai/nomic-embed-text-v1.5` (137M): Nomic improved. * `sentence-transformers/all-mpnet-base-v2` (109M): MPNet base. **Specialized models:** * `nomic-ai/nomic-embed-text-v2-moe` (475M-A305M): Mixture of experts. * `Alibaba-NLP/gte-large-en-v1.5` (434M): Alibaba large English. * `answerdotai/ModernBERT-large` (396M): Modern BERT large. * `jinaai/jina-embeddings-v2-base-en` (137M): Jina English. * `jinaai/jina-embeddings-v2-base-code` (137M): Jina code. ### Re-ranking models **Top re-rankers:** * `BAAI/bge-reranker-large`: XLM-RoBERTa based. * `BAAI/bge-reranker-base`: XLM-RoBERTa base. * `Alibaba-NLP/gte-multilingual-reranker-base`: GTE multilingual. * `Alibaba-NLP/gte-reranker-modernbert-base`: ModernBERT reranker. ### Classification models **Sentiment analysis:** * `SamLowe/roberta-base-go_emotions`: RoBERTa for emotions. ## Supported model families ### Popular Hugging Face models Find supported models on Hugging Face: * [Embedding Models](https://huggingface.co/models?pipeline_tag=feature-extraction\&other=text-embeddings-inference\&sort=trending) * [Classification Models](https://huggingface.co/models?pipeline_tag=text-classification\&other=text-embeddings-inference\&sort=trending) ### Sentence-transformers The most common BERT-based embedding models, optimized for semantic similarity. **Popular models:** * `sentence-transformers/all-MiniLM-L6-v2` (384D, 22M params) * `sentence-transformers/all-mpnet-base-v2` (768D, 110M params) * `sentence-transformers/multi-qa-mpnet-base-dot-v1` (768D, 110M params) **Configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "sentence-transformers/all-MiniLM-L6-v2" quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings ``` ### Voyage and Nemotron Bidrectional LLMs Large-decoder architectures with bidirectional attention like Qwen3 (`voyageai/voyage-4-nano`) or Llama3 (`nvidia/llama-embed-nemotron-8b`) can be deployed with BEi-bert. **Configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "voyageai/voyage-4-nano" # rewrite of the config files for compatability (no custom code support) revision: "refs/pr/5" quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings ``` ### Jina AI embeddings Jina's BERT-based models optimized for various domains including code. **Popular models:** * `jinaai/jina-embeddings-v2-base-en` (512D, 137M params) * `jinaai/jina-embeddings-v2-base-code` (512D, 137M params) * `jinaai/jina-embeddings-v2-base-es` (512D, 137M params) **Configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "jinaai/jina-embeddings-v2-base-en" quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings ``` ### Nomic AI embeddings Nomic's models with specialized training for text and code. **Popular models:** * `nomic-ai/nomic-embed-text-v1.5` (768D, 137M params) * `nomic-ai/nomic-embed-code-v1.5` (768D, 137M params) **Configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "nomic-ai/nomic-embed-text-v1.5" quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings ``` ### Alibaba GTE and Qwen models Advanced multilingual models with instruction-tuning and long-context support. **Popular models:** * `Alibaba-NLP/gte-Qwen2-7B-instruct`: Top-ranked multilingual. * `Alibaba-NLP/gte-Qwen2-1.5B-instruct`: Cost-effective alternative. * `intfloat/multilingual-e5-large-instruct`: E5 multilingual variant. **Configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "Alibaba-NLP/gte-Qwen2-7B-instruct" quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings ``` ## Configuration examples ### Cost-effective GTE-Qwen deployment ```yaml theme={"system"} model_name: BEI-Bert-GTE-Qwen-1.5B resources: accelerator: L4 cpu: '1' memory: 15Gi use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "Alibaba-NLP/gte-Qwen2-1.5B-instruct" revision: main max_num_tokens: 8192 quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.85 batch_scheduler_policy: guaranteed_no_evict ``` ### Basic sentence-transformer deployment ```yaml theme={"system"} model_name: BEI-Bert-MiniLM resources: accelerator: L4 cpu: '1' memory: 10Gi use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "sentence-transformers/all-MiniLM-L6-v2" revision: main max_num_tokens: 8192 quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ### Jina code embeddings deployment ```yaml theme={"system"} model_name: BEI-Bert-Jina-Code resources: accelerator: H100 cpu: '1' memory: 10Gi use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "jinaai/jina-embeddings-v2-base-code" revision: main max_num_tokens: 8192 quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ### Nomic text embeddings with custom routing ```yaml theme={"system"} model_name: BEI-Bert-Nomic-Text resources: accelerator: L4 cpu: '1' memory: 10Gi use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "nomic-ai/nomic-embed-text-v1.5" revision: main max_num_tokens: 16384 quantization_type: no_quant runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.85 batch_scheduler_policy: guaranteed_no_evict ``` ## Integration examples ### OpenAI client with Qwen3 instructions ```python theme={"system"} from openai import OpenAI import os client = OpenAI( api_key=os.environ['BASETEN_API_KEY'], base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1" ) response = client.embeddings.create( input="This is a test sentence for embedding.", model="not-required" ) # Batch embedding with multiple documents documents = [ "Product documentation for software library", "User question about API usage", "Code snippet example" ] response = client.embeddings.create( input=documents, model="not-required" ) print(f"Embedding dimension: {len(response.data[0].embedding)}") print(f"Processed {len(response.data)} embeddings") ``` ### Baseten Performance Client For maximum throughput with BEI-Bert: ```python theme={"system"} from baseten_performance_client import PerformanceClient client = PerformanceClient( api_key=os.environ['BASETEN_API_KEY'], base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync" ) # High-throughput batch processing texts = [f"Sentence {i}" for i in range(1000)] response = client.embed( input=texts, model="not-required", batch_size=8, max_concurrent_requests=16, timeout_s=300 ) print(f"Processed {len(response.numpy())} embeddings") print(f"Embedding shape: {response.numpy().shape}") ``` ### Direct API usage ```python theme={"system"} import requests import os import json headers = { "Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}", "Content-Type": "application/json" } data = { "input": ["Text to embed", "Another text"], "encoding_format": "float" } response = requests.post( "https://model-xxxxxx.api.baseten.co/environments/production/sync/v1/embeddings", headers=headers, json=data ) result = response.json() print(f"Embeddings: {len(result['data'])} embeddings generated") ``` ## Best practices ### Model selection guide Choose based on your primary constraint: **Cost-effective (balanced performance/cost):** * `Alibaba-NLP/gte-Qwen2-7B-instruct`: Instruction-tuned, ranked #1 for multilingual. * `Alibaba-NLP/gte-Qwen2-1.5B-instruct`: 1/5 the size, still top-tier. * `Snowflake/snowflake-arctic-embed-m-v2.0`: Multilingual-optimized, MRL support. **Lightweight & fast (under 500M):** * `google/embeddinggemma-300m`: 300M params, 100+ languages. * `Snowflake/snowflake-arctic-embed-m-v2.0`: 305M, compression-friendly. * `nomic-ai/nomic-embed-text-v1.5`: 137M, minimal latency. * `sentence-transformers/all-MiniLM-L6-v2`: 22M, legacy standard. **Specialized:** * **Code:** `jinaai/jina-embeddings-v2-base-code` * **Long sequences:** `Alibaba-NLP/gte-large-en-v1.5` * **Re-ranking:** `BAAI/bge-reranker-large`, `Alibaba-NLP/gte-reranker-modernbert-base` ### Hardware optimization **Cost-effective deployments:** * L4 GPUs for models `<200M` parameters * H100 GPUs for models 200-500M parameters * Enable autoscaling for variable traffic **Performance optimization:** * Use `max_num_tokens: 8192` for most use cases * Use `max_num_tokens: 16384` for long documents * Tune `batch_scheduler_policy` based on traffic patterns ### Deployment strategies **For development:** * Start with smaller models (MiniLM) * Use L4 GPUs for cost efficiency * Enable detailed logging **For production:** * Use larger models (MPNet) for better quality * Use H100 GPUs for better performance * Implement monitoring and alerting **For edge deployments:** * Use smallest suitable models * Optimize for cold-start performance * Consider model size constraints ## Troubleshooting ### Common issues **Slow cold-start times:** * Ensure model is properly cached * Consider using smaller models * Check GPU memory availability **Lower than expected throughput:** * Verify `max_num_tokens` is appropriate * Check `batch_scheduler_policy` settings * Monitor GPU utilization **Memory issues:** * Reduce `max_num_tokens` if needed * Use smaller models for available memory * Monitor memory usage during deployment ### Performance tuning **For lower latency:** * Reduce `max_num_tokens` * Use `batch_scheduler_policy: guaranteed_no_evict` * Consider smaller models **For higher throughput:** * Increase `max_num_tokens` appropriately * Use `batch_scheduler_policy: max_utilization` * Optimize batch sizes in client code **For cost optimization:** * Use L4 GPUs when possible * Choose appropriately sized models * Implement efficient autoscaling ## Migration from other systems ### From sentence-transformers library **Python code:** ```python theme={"system"} # Before (sentence-transformers) from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(sentences) # After (BEI-Bert) from openai import OpenAI client = OpenAI(api_key=BASETEN_API_KEY, base_url=BASE_URL) embeddings = client.embeddings.create(input=sentences, model="not-required") ``` ### From other embedding services BEI-Bert provides OpenAI-compatible endpoints: 1. **Update base URL**: Point to Baseten deployment 2. **Update API key**: Use Baseten API key 3. **Test compatibility**: Verify embedding dimensions and quality 4. **Optimize**: Tune batch sizes and concurrency for performance ## Further reading * [BEI overview](/engines/bei/overview) - General BEI documentation * [BEI reference config](/engines/bei/bei-reference) - Complete configuration options * [Embedding examples](/examples/bei) - Concrete deployment examples * [Performance client documentation](/engines/performance-concepts/performance-client) - Client Usage with Embeddings * [Performance optimization](/development/model/performance-optimization) - General performance guidance --- # Source: https://docs.baseten.co/engines/bei/bei-reference.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Configuration reference > Complete reference config for BEI and BEI-Bert engines This reference covers all configuration options for BEI and BEI-Bert deployments. All settings use the `trt_llm` section in `config.yaml`. ## Configuration structure ```yaml theme={"system"} trt_llm: inference_stack: v1 # Always v1 for BEI build: base_model: encoder | encoder_bert checkpoint_repository: {...} max_num_tokens: 16384 quantization_type: no_quant | fp8 | fp4 | fp4_kv quantization_config: {...} plugin_configuration: {...} runtime: webserver_default_route: /v1/embeddings | /rerank | /predict kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict ``` ## Build configuration The `build` section configures model compilation and optimization settings. The base model architecture determines which BEI variant to use. **Options:** * `encoder`: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma) * `encoder_bert`: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic) ```yaml theme={"system"} build: base_model: encoder ``` Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure. **Source options:** * `HF`: Hugging Face Hub (default) * `GCS`: Google Cloud Storage * `S3`: AWS S3 * `AZURE`: Azure Blob Storage * `REMOTE_URL`: HTTP URL to tar.gz file * `BASETEN_TRAINING`: Baseten Training checkpoints For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3). ```yaml theme={"system"} checkpoint_repository: source: HF repo: "BAAI/bge-large-en-v1.5" revision: main runtime_secret_name: hf_access_token # Optional, for private repos ``` Maximum number of tokens that can be processed in a single batch. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the `max_position_embeddings` value. **Range:** 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default. ```yaml theme={"system"} build: max_num_tokens: 16384 ``` Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded. Specifies the quantization format for model weights. `FP8` quantization maintains accuracy within 1% of `FP16` for embedding models. **Options for BEI:** * `no_quant`: `FP16`/`BF16` precision * `fp8`: `FP8` weights + 16-bit KV cache * `fp4`: `FP4` weights + 16-bit KV cache (B200 only) * `fp4_mlp_only`: `FP4` MLP weights only (B200 only) **Options for BEI-Bert:** * `no_quant`: `FP16` precision (only option) For detailed quantization guidance, see [Quantization guide](/engines/performance-concepts/quantization-guide). ```yaml theme={"system"} build: quantization_type: fp8 ``` Configuration for post-training quantization calibration. **Fields:** * `calib_size`: Size of calibration dataset (64-16384, multiple of 64) * `calib_dataset`: HuggingFace dataset for calibration * `calib_max_seq_length`: Maximum sequence length for calibration ```yaml theme={"system"} quantization_config: calib_size: 512 calib_dataset: "cnn_dailymail" calib_max_seq_length: 1024 ``` BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported. **Automatic optimizations:** * XQA kernels for maximum throughput * Dynamic batching for optimal utilization * Memory-efficient attention mechanisms * Hardware-specific optimizations **Note:** Plugin configuration is only available for Engine-Builder-LLM engine. ## Runtime configuration The `runtime` section configures serving behavior. The default API endpoint for the deployment. **Options:** * `/v1/embeddings`: OpenAI-compatible embeddings endpoint * `/rerank`: Reranking endpoint * `/predict`: Classification/prediction endpoint BEI automatically detects embedding models and sets `/v1/embeddings`. Classification models default to `/predict`. ```yaml theme={"system"} runtime: webserver_default_route: /v1/embeddings ``` Not applicable to BEI engines. Only used for generative models. Not applicable to BEI engines. Only used for generative models. Not applicable to BEI engines. Only used for generative models. ## HuggingFace Model Repository Structure All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running: ```bash theme={"system"} git clone https://huggingface.co/michaelfeil/bge-small-en-v1.5 ``` ### Model configuration **config.json** * `max_position_embeddings`: Limits maximum context size (content beyond this is truncated) * `id2label`: Required dictionary mapping IDs to labels for classification models. * **Note**: Needs to have len of the shape of the last dense layer. Each dense output needs a `name` for the json response. * `architecture`: Must be `ModelForSequenceClassification` or similar (cannot be `ForCausalLM`) * **Note**: Remote code execution is not supported; architecture is inferred automatically * `torch_dtype`: Default inference dtype (BEI-Bert: always `fp16`, BEI: `float16`, `bfloat16`) * **Note**: We don't support `pre-quantized` loading, meaning your weights need to be `float16`, `bfloat16` or `float32` for all engines. * `quant_config`: Not allowed, as no `pre-quantized` weights. #### Model weights **model.safetensors** (preferred) * Or: `model.safetensors.index.json` + `model-xx-of-yy.safetensors` (sharded) * **Note**: Convert to safetensors if you encounter issues with other formats #### Tokenizer files **tokenizer\_config.json** and **tokenizer.json** * Must be "FAST" tokenizers compatible with Rust * Typically cannot contain custom Python code, will be unread. #### Embedding model files (sentence-transformers) **1\_Pooling/config.json** * Required for embedding models to define pooling strategy **modules.json** * Required for embedding models * Shows available pooling layers and configurations ### Pooling layer support | **Engine** | **Classification Layers** | **Pooling Types** | **Notes** | | ------------ | -------------------------- | --------------------------------------------- | ------------------------ | | **BEI** | 1 layer maximum | Last token, first token | Limited pooling options | | **BEI-Bert** | Multiple layers or 1 layer | Last token, first token, mean, SPLADE pooling | Advanced pooling support | ## Complete configuration examples ### BEI with `FP8` quantization (embedding model) ```yaml theme={"system"} model_name: BEI-BGE-Large-FP8 resources: accelerator: H100 use_gpu: true trt_llm: build: base_model: encoder checkpoint_repository: source: HF repo: "Qwen/Qwen3-Embedding-8B" revision: main max_num_tokens: 16384 quantization_type: fp8 quantization_config: calib_size: 1536 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: false runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ### BEI-Bert for small BERT model ```yaml theme={"system"} model_name: BEI-Bert-MiniLM-L6 resources: accelerator: L4 use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "sentence-transformers/all-MiniLM-L6-v2" revision: main max_num_tokens: 8192 quantization_type: no_quant plugin_configuration: # Limited options for encoder models paged_kv_cache: false # Disabled for encoder_bert use_paged_context_fmha: false use_fp8_context_fmha: false runtime: webserver_default_route: /v1/embeddings kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ### BEI for reranking model ```yaml theme={"system"} model_name: BEI-BGE-Reranker resources: accelerator: H100 use_gpu: true trt_llm: build: base_model: encoder checkpoint_repository: source: HF repo: "BAAI/bge-reranker-large" revision: main max_num_tokens: 16384 quantization_type: fp8 quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 runtime: webserver_default_route: /rerank kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ### BEI-Bert for classification model ```yaml theme={"system"} model_name: BEI-Bert-Language-Detection resources: accelerator: L4 use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: HF repo: "papluca/xlm-roberta-base-language-detection" revision: main max_num_tokens: 8192 quantization_type: no_quant runtime: webserver_default_route: /predict kv_cache_free_gpu_mem_fraction: 0.9 batch_scheduler_policy: guaranteed_no_evict ``` ## Validation and troubleshooting ### Common configuration errors **Error:** `encoder does not have a kv-cache, therefore a kv specfic datatype is not valid` * **Cause:** Using KV quantization (fp8\_kv, fp4\_kv) with encoder models * **Fix:** Use `fp8` or `no_quant` instead **Error:** `FP8 quantization is only supported on L4, H100, H200, B200` * **Cause:** Using `FP8` quantization on unsupported GPU. * **Fix:** Use H100 or newer GPU, or use `no_quant`. **Error:** `FP4 quantization is only supported on B200` * **Cause:** Using `FP4` quantization on unsupported GPU. * **Fix:** Use B200 GPU or `FP8` quantization. ### Performance tuning **For maximum throughput:** * Use `max_num_tokens: 16384` for BEI. * Enable `FP8` quantization on supported hardware. * Use `batch_scheduler_policy: max_utilization` for high load. **For lowest latency:** * Use smaller `max_num_tokens` for your use case * Use `batch_scheduler_policy: guaranteed_no_evict` * Consider BEI-Bert for small models with cold-start optimization **For cost optimization:** * Use L4 GPUs with `FP8` quantization. * Use BEI-Bert for small models. * Tune `max_num_tokens` to your actual requirements. ## Migration from older configurations If you're migrating from older BEI configurations: 1. **Update base\_model**: Change from specific model types to `encoder` or `encoder_bert` 2. **Add checkpoint\_repository**: Use the new structured repository configuration 3. **Review quantization**: Ensure quantization type matches hardware capabilities 4. **Update engine**: Add engine configuration for better performance **Old configuration:** ```yaml theme={"system"} trt_llm: build: model_type: "bge" checkpoint_repo: "BAAI/bge-large-en-v1.5" ``` **New configuration:** ```yaml theme={"system"} trt_llm: build: base_model: encoder checkpoint_repository: source: HF repo: "BAAI/bge-large-en-v1.5" max_num_tokens: 16384 quantization_type: fp8 runtime: webserver_default_route: /v1/embeddings ``` --- # Source: https://docs.baseten.co/examples/bei.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Embeddings with BEI > Serve embedding, reranking, and classification models Baseten Embeddings Inference is Baseten's solution for production grade inference on embedding, classification and reranking models using TensorRT-LLM. With Baseten Embeddings Inference you get the following benefits: * Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)1 * Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2 * High parallelism: up to 1400 client embeddings per second * Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime * Ahead-of-time compilation, memory allocation and fp8 post-training quantization ### Getting started with embedding models: Embedding models are LLMs without a lm\_head for language generation. Typical architectures that are supported for embeddings are `LlamaModel`, `BertModel`, `RobertaModel` or `Gemma2Model`, and contain the safetensors, config, tokenizer and sentence-transformer config files. A good example is the repo [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2). To deploy a model for embeddings, set the following config in your local directory. ```yaml config.yaml theme={"system"} model_name: BEI-Linq-Embed-Mistral resources: accelerator: H100_40GB use_gpu: true trt_llm: build: base_model: encoder checkpoint_repository: # for a different model, change the repo to e.g. to "Salesforce/SFR-Embedding-Mistral" # "BAAI/bge-en-icl" or "BAAI/bge-m3" repo: "Linq-AI-Research/Linq-Embed-Mistral" revision: main source: HF # only Llama, Mistral and Qwen Models support quantization. # others, use: "quantization_type: no_quant" quantization_type: fp8 ``` With `config.yaml` in your local directory, you can deploy the model to Baseten. ```bash theme={"system"} truss push --publish --promote ``` Deployed embedding models are OpenAI compatible without any additional settings. You may use the client code below to consume the model. ```python theme={"system"} from openai import OpenAI import os client = OpenAI( api_key=os.environ['BASETEN_API_KEY'], # add the deployment URL base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1" ) embedding = client.embeddings.create( input=["Baseten Embeddings are fast.", "Embed this sentence!"], model="not-required" ) ``` ### Example deployment of a classification, reranking and classification models Besides embedding models, BEI deploys high-throughput rerank and classification models. You can identify suitable architectures by their `ForSequenceClassification` suffix in the huggingface repo. The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation. ```yaml theme={"system"} model_name: BEI-mixedbread-rerank-large-v2-fp8 resources: accelerator: H100_40GB cpu: '1' memory: 10Gi use_gpu: true trt_llm: build: base_model: encoder checkpoint_repository: repo: michaelfeil/mxbai-rerank-large-v2-seq revision: main source: HF # only Llama, Mistral and Qwen Models support quantization quantization_type: fp8 ``` As OpenAI does not offer reranking or classification, we are sending a simple request to the endpoint. Depending on the model, you might want to apply a specific prompt template first. ```python theme={"system"} import requests import os headers = { f"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}" } # model specific prompt for mixedbread's reranker v2. prompt = ( "<|endoftext|><|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n<|im_end|>\n<|im_start|>user\n" "query: {query} \ndocument: {doc} \nYou are a search relevance expert who evaluates how well documents match search queries. For each query-document pair, carefully analyze the semantic relationship between them, then provide your binary relevance judgment (0 for not relevant, 1 for relevant).\nRelevance:<|im_end|>\n<|im_start|>assistant\n" ).format(query="What is Baseten?",doc="Baseten is a fast inference provider.") requests.post( headers=headers, url="https://model-xxxxxx.api.baseten.co/environments/production/sync/predict", json={ "inputs": prompt, "raw_scores": True, } ) ``` ### Benchmarks and Performance optimizations Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers. The team behind the implementation are the authors of [infinity](https://github.com/michaelfeil/infinity). We recommend using fp8 quantization for LLama, Mistral and Qwen2 models on L4 or newer (L4, H100, H200 and B200). Quality difference between fp8 and bfloat16 is often negligible - embedding models often retentain of >99% cosine simalarity between both presisions, and reranking models retain the ranking order - despite a difference in the retained output. For more details, check out the [technical launch post](https://www.baseten.co/blog/how-we-built-high-throughput-embedding-inference-with-tensorrt-llm/). The team at Baseten has additional options for sharing cached model weights across replicas - for very fast horizontal scaling. Please contact us to enable this option. ### Deploy custom or fine-tuned models on BEI: We support the deployment of of the below models, as well all finetuned variants of these models (same architecture & customized weights). The following repositories are supported - this list is not exhaustive. | Model Repository | Architecture | Function | | ------------------------------------------------------------------------------------------------------------- | ----------------------------------- | ------------------- | | [`Salesforce/SFR-Embedding-Mistral`](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) | MistralModel | embedding | | [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) | BertModel | embedding | | [`BAAI/bge-multilingual-gemma2`](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Gemma2Model | embedding | | [`mixedbread-ai/mxbai-embed-large-v1`](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | BertModel | embedding | | [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5) | BertModel | embedding | | [`allenai/Llama-3.1-Tulu-3-8B-RM`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | LlamaForSequenceClassification | classifier | | [`ncbi/MedCPT-Cross-Encoder`](https://huggingface.co/ncbi/MedCPT-Cross-Encoder) | BertForSequenceClassification | reranker/classifier | | [`SamLowe/roberta-base-go_emotions`](https://huggingface.co/SamLowe/roberta-base-go_emotions) | XLMRobertaForSequenceClassification | classifier | | [`mixedbread/mxbai-rerank-large-v2-seq`](https://huggingface.co/michaelfeil/mxbai-rerank-large-v2-seq) | Qwen2ForSequenceClassification | reranker/classifier | | [`BAAI/bge-en-icl`](https://huggingface.co/BAAI/bge-en-icl) | LlamaModel | embedding | | [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) | BertForSequenceClassification | reranker/classifier | | [`Skywork/Skywork-Reward-Llama-3.1-8B-v0.2`](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) | LlamaForSequenceClassification | classifier | | [`Snowflake/snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) | BertModel | embedding | | [`nomic-ai/nomic-embed-code`](https://huggingface.co/nomic-ai/nomic-embed-code) | Qwen2Model | embedding | 1 measured on H100-HBM3 (bert-large-335M, for BAAI/bge-en-icl: 9ms) 2 measured on H100-HBM3 (leading model architecture on MTEB, MistralModel-7B) --- # Source: https://docs.baseten.co/inference/output-format/binary.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Model I/O in binary > Decode and save binary model output Baseten and Truss natively support model I/O in binary and use msgpack encoding for efficiency. ## Deploy a basic Truss for binary I/O If you need a deployed model to try the invocation examples below, follow these steps to create and deploy a super basic Truss that accepts and returns binary data. The Truss performs no operations and is purely illustrative. To create a Truss, run: ```sh theme={"system"} truss init binary_test ``` This creates a Truss in a new directory `binary_test`. By default, newly created Trusses implement an identity function that returns the exact input they are given. Optionally, modify `binary_test/model/model.py` to log that the data received is of type `bytes`: ```python binary_test/model/model.py theme={"system"} def predict(self, model_input): # Run model inference here print(f"Input type: {type(model_input['byte_data'])}") return model_input ``` Deploy the Truss to Baseten for development: ```sh theme={"system"} truss push --watch ``` Or for production: ```sh theme={"system"} truss push --publish ``` ## Send raw bytes as model input To send binary data as model input: 1. Set the `content-type` HTTP header to `application/octet-stream` 2. Use `msgpack` to encode the data or file 3. Make a POST request to the model This code sample assumes you have a file `Gettysburg.mp3` in the current working directory. You can download the [11-second file from our CDN](https://cdn.baseten.co/docs/production/Gettysburg.mp3) or replace it with your own file. ```python call_model.py theme={"system"} import os import requests import msgpack model_id = "MODEL_ID" # Replace this with your model ID deployment = "development" # `development`, `production`, or a deployment ID baseten_api_key = os.environ["BASETEN_API_KEY"] # Specify the URL to which you want to send the POST request url = f"https://model-{model_id}.api.baseten.co/{deployment}/predict" headers={ "Authorization": f"Api-Key {baseten_api_key}", "content-type": "application/octet-stream", } with open('Gettysburg.mp3', 'rb') as file: response = requests.post( url, headers=headers, data=msgpack.packb({'byte_data': file.read()}) ) print(response.status_code) print(response.headers) ``` To support certain types like numpy and datetime values, you may need to extend client-side `msgpack` encoding with the same [encoder and decoder used by Truss](https://github.com/basetenlabs/truss/blob/main/truss/templates/shared/serialization.py). ## Parse raw bytes from model output To use the output of a non-streaming model response, decode the response content. ```python call_model.py theme={"system"} # Continues `call_model.py` from above binary_output = msgpack.unpackb(response.content) # Change extension if not working with mp3 data with open('output.mp3', 'wb') as file: file.write(binary_output["byte_data"]) ``` ## Streaming binary outputs You can also stream output as binary. This is useful for sending large files or reading binary output as it is generated. In the `model.py`, you must create a streaming output. ```python model/model.py theme={"system"} # Replace the predict function in your Truss def predict(self, model_input): import os current_dir = os.path.dirname(__file__) file_path = os.path.join(current_dir, "tmpfile.txt") with open(file_path, mode="wb") as file: file.write(bytes(model_input["text"], encoding="utf-8")) def iterfile(): # Get the directory of the current file current_dir = os.path.dirname(__file__) # Construct the full path to the .wav file file_path = os.path.join(current_dir, "tmpfile.txt") with open(file_path, mode="rb") as file_like: yield from file_like return iterfile() ``` Then, in your client, you can use streaming output directly without decoding. ```python stream_model.py theme={"system"} import os import requests import json model_id = "MODEL_ID" # Replace this with your model ID deployment = "development" # `development`, `production`, or a deployment ID baseten_api_key = os.environ["BASETEN_API_KEY"] # Specify the URL to which you want to send the POST request url = f"https://model-{model_id}.api.baseten.co/{deployment}/predict" headers={ "Authorization": f"Api-Key {baseten_api_key}", } s = requests.Session() with s.post( # Endpoint for production deployment, see API reference for more f"https://model-{model_id}.api.baseten.co/{deployment}/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, data=json.dumps({"text": "Lorem Ipsum"}), # Include stream=True as an argument so the requests libray knows to stream stream=True, ) as response: for token in response.iter_content(1): print(token) # Prints bytes ``` --- # Source: https://docs.baseten.co/development/chain/binaryio.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Binary IO > Performant serialization of numeric data Numeric data or audio/video are most efficiently transmitted as bytes. Other representations such as JSON or base64 encoding lose precision, add significant parsing overhead and increase message sizes (e.g. \~33% increase for base64 encoding). Chains extends the JSON-centred pydantic ecosystem with two ways how you can include binary data: numpy array support and raw bytes. ## Numpy `ndarray` support Once you have your data represented as a numpy array, you can easily (and often without copying) convert it to `torch`, `tensorflow` or other common numeric library's objects. To include numpy arrays in a pydantic model, chains has a special field type implementation `NumpyArrayField`. For example: ```python theme={"system"} import numpy as np import pydantic from truss_chains import pydantic_numpy class DataModel(pydantic.BaseModel): some_numbers: pydantic_numpy.NumpyArrayField other_field: str ... numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") print(data) # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [0.39595027 0.23837526] # [0.56714894 0.61244946] # [0.45821942 0.42464844]]) # other_field='Example' ``` `NumpyArrayField` is a wrapper around the actual numpy array. Inside your python code, you can work with its `array` attribute: ```python theme={"system"} data.some_numbers.array += 10 # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [10.39595027 10.23837526] # [10.56714894 10.61244946] # [10.45821942 10.42464844]]) # other_field='Example' ``` The interesting part is, how it serializes when making communicating between Chainlets or with a client. It can work in two modes: JSON and binary. ### Binary As a JSON alternative that supports byte data, Chains uses `msgpack` (with `msgpack_numpy`) to serialize the dict representation. For Chainlet-Chainlet RPCs this is done automatically for you by enabling binary mode of the dependency Chainlets, see [all options](/reference/sdk/chains#truss-chains-depends): ```python theme={"system"} import truss_chains as chains class Worker(chains.ChainletBase): async def run_remote(self, data: DataModel) -> DataModel: data.some_numbers.array += 10 return data class Consumer(chains.ChainletBase): def __init__(self, worker=chains.depends(Worker, use_binary=True)): self._worker = worker async def run_remote(self): numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") result = await self._worker.run_remote(data) ``` Now the data is transmitted in a fast and compact way between Chainlets which often gives performance increases. ### Binary client If you want to send such data as input to a chain or parse binary output from a chain, you have to add the `msgpack` serialization client-side: ```python theme={"system"} import requests import msgpack import msgpack_numpy msgpack_numpy.patch() # Register hook for numpy. # Dump to "python" dict and then to binary. data_dict = data.model_dump(mode="python") data_bytes = msgpack.dumps(data_dict) # Set binary content type in request header. headers = { "Content-Type": "application/octet-stream", "Authorization": ... } response = requests.post(url, data=data_bytes, headers=headers) response_dict = msgpack.loads(response.content) response_model = ResponseModel.model_validate(response_dict) ``` The steps of dumping from a pydantic model and validating the response dict into a pydantic model can be skipped, if you prefer working with raw dicts on the client. The implementation of `NumpyArrayField` only needs `pydantic`, no other Chains dependencies. So you can take that implementation code in isolation and integrate it in your client code. Some version combinations of `msgpack` and `msgpack_numpy` give errors, we know that `msgpack = ">=1.0.2"` and `msgpack-numpy = ">=0.4.8"` work. ### JSON The JSON-schema to represent the array is a dict of `shape (tuple[int]), dtype (str), data_b64 (str)`. E.g. ```python theme={"system"} print(data.model_dump_json()) '{"some_numbers":{"shape":[3,2],"dtype":"float64", "data_b64":"30d4/rnKJEAsvm...' ``` The base64 data corresponds to `np.ndarray.tobytes()`. To get back to the array from the JSON string, use the model's `model_validate_json` method. As discussed in the beginning, this schema is not performant for numeric data and only offered as a compatibility layer (JSON does not allow bytes) - generally prefer the binary format. # Simple `bytes` fields It is possible to add a `bytes` field to a pydantic model used in a chain, or as a plain argument to `run_remote`. This can be useful to include non-numpy data formats such as images or audio/video snippets. In this case, the "normal" JSON representation does not work and all involved requests or Chainlet-Chainlet-invocations must use binary mode. The same steps as for arrays [above](#binary-client) apply: construct dicts with `bytes` values and keys corresponding to the `run_remote` argument names or the field names in the pydantic model. Then use `msgpack` to serialize and deserialize those dicts. Don't forget to add `Content-type` headers and that `response.json()` will not work. --- # Source: https://docs.baseten.co/engines/bis-llm/bis-llm-config.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Reference Config (BIS-LLM) > Complete reference config for V2 inference stack and MoE models This reference provides complete configuration options for BIS-LLM (Baseten Inference Stack V2) engine. BIS-LLM uses the V2 inference stack with simplified configuration and enhanced features for MoE models and advanced use cases. ## Configuration structure ```yaml theme={"system"} trt_llm: inference_stack: v2 # Always v2 for BIS-LLM build: checkpoint_repository: {...} quantization_type: no_quant | fp8 | fp4 quantization_config: {...} num_builder_gpus: 1 skip_build_result: false runtime: max_seq_len: 32768 max_batch_size: 256 max_num_tokens: 8192 tensor_parallel_size: 1 enable_chunked_prefill: true served_model_name: "model-name" patch_kwargs: {...} ``` ## Build configuration ### `checkpoint_repository` Specifies where to find the model checkpoint. Same structure as V1 but with V2-specific optimizations. **Structure:** ```yaml theme={"system"} checkpoint_repository: source: HF | GCS | S3 | AZURE | REMOTE_URL | BASETEN_TRAINING repo: "model-repository-name" revision: main # Optional, only for HF runtime_secret_name: hf_access_token # Optional, for private repos ``` For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3). ### `quantization_type` Quantization options for V2 inference stack (simplified from V1): **Options:** * `no_quant`: precision of the repo. This can be fp16 / bf16. Unique to BIS-LLM is that we also do support quantized checkpoints from nvidia-modelopt libraries. * `fp8`: FP8 weights + 16-bit KV cache * `fp4`: FP4 weights + 16-bit KV cache (B200 only) * `fp4_mlp_only`: FP4 MLP layers only + 16-bit KV cache For detailed quantization guidance including hardware requirements, calibration strategies, and model-specific recommendations, see [Quantization Guide](/engines/performance-concepts/quantization-guide). ### `quantization_config` Configuration for post-training quantization calibration: **Structure:** ```yaml theme={"system"} quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 ``` ### `num_builder_gpus` Number of GPUs to use during the build process. **Default:** `1` (auto-detected from resources)\ **Range:** 1 to 8 **Example:** ```yaml theme={"system"} build: num_builder_gpus: 4 # For large models or complex quantization ``` ### `skip_build_result` Skip the engine build step and use a pre-built model, that does not require any quantization. **Default:** `false`\ **Use case:** When you have a pre-built engine from model cache **Example:** ```yaml theme={"system"} build: skip_build_result: true ``` ## Engine configuration ### `max_seq_len` Maximum sequence length (context) for single requests. **Default:** `32768` (64K)\ **Range:** 1 to 1048576 **Example:** ```yaml theme={"system"} runtime: max_seq_len: 131072 # 128K context ``` ### `max_batch_size` Maximum number of input sequences processed concurrently. **Default:** `256`\ **Range:** 1 to 2048 **Example:** ```yaml theme={"system"} runtime: max_batch_size: 128 # Lower for better latency ``` ### `max_num_tokens` Maximum number of batched input tokens after padding removal. **Default:** `8192`\ **Range:** 64 to 131072 **Example:** ```yaml theme={"system"} runtime: max_num_tokens: 16384 # Higher for better throughput ``` ### `tensor_parallel_size` Number of GPUs to use for tensor parallelism. **Default:** `1` (auto-detected from resources)\ **Range:** 1 to 8 **Example:** ```yaml theme={"system"} runtime: tensor_parallel_size: 4 # For large models ``` ### `enable_chunked_prefill` Enable chunked prefilling for long sequences. **Default:** `true` **Example:** ```yaml theme={"system"} runtime: enable_chunked_prefill: true ``` ### `served_model_name` Model name returned in API responses. **Default:** `None` (uses model name from config) **Example:** ```yaml theme={"system"} runtime: served_model_name: "gpt-oss-120b" ``` ### `patch_kwargs` Advanced configuration patches for V2 inference stack. **Structure:** ```yaml theme={"system"} patch_kwargs: custom_setting: "value" advanced_config: nested_setting: true ``` **Note:** This is a preview feature and may change in future versions. ## Complete configuration examples ### Qwen3-30B-A3B-Instruct-2507 MoE with FP4 on B200 ```yaml theme={"system"} model_name: Qwen3-30B-A3B-Instruct-2507-FP4 resources: accelerator: B200:1 cpu: '4' memory: 40Gi use_gpu: true trt_llm: inference_stack: v2 build: checkpoint_repository: source: HF repo: "Qwen/Qwen3-Coder-30B-A3B-Instruct" revision: main quantization_type: fp4 quantization_config: calib_size: 2048 calib_dataset: "cnn_dailymail" calib_max_seq_length: 4096 num_builder_gpus: 1 runtime: max_seq_len: 65536 max_batch_size: 256 max_num_tokens: 8192 tensor_parallel_size: 1 enable_chunked_prefill: true served_model_name: "Qwen3-30B-A3B-Instruct-2507" ``` ### GPT-OSS 120B on B200:1 with no\_quant **Note**: We have GPT-OSS much more optimized. The below example is functional, but you can sequeeze much more performance using `B200`, e.g. with Baseten's custom Eagle Heads. ```yaml theme={"system"} model_name: gpt-oss-120b-b200 resources: accelerator: B200:1 cpu: '4' memory: 40Gi use_gpu: true trt_llm: inference_stack: v2 build: checkpoint_repository: source: HF repo: "openai/gpt-oss-120b" revision: main runtime_secret_name: hf_access_token quantization_type: no_quant quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 runtime: max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 16384 tensor_parallel_size: 1 enable_chunked_prefill: true served_model_name: "gpt-oss-120b" ``` ### DeepSeek V3 **Note**: We have DeepSeek V3 / V3.1 / V3.2 much more optimized. The below example is functional, but you can sequeeze much more performance using `B200:4`, e.g. with MTP Heads and disaggregated serving, or data-parallel attention. ```yaml theme={"system"} model_name: nvidia/DeepSeek-V3.1-NVFP4 resources: accelerator: B200:4 cpu: '8' memory: 80Gi use_gpu: true trt_llm: inference_stack: v2 build: checkpoint_repository: source: HF repo: "nvidia/DeepSeek-V3.1-NVFP4" revision: main runtime_secret_name: hf_access_token quantization_type: no_quant # nvidia/DeepSeek-V3.1-NVFP4 is already modelopt compatible quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 runtime: max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 16384 tensor_parallel_size: 8 enable_chunked_prefill: true served_model_name: "nvidia/DeepSeek-V3.1-NVFP4" ``` ## V2 vs V1 configuration differences ### Simplified build configuration **V1 build configuration:** ```yaml theme={"system"} trt_llm: build: base_model: decoder max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 8192 quantization_type: fp8_kv tensor_parallel_count: 4 plugin_configuration: {...} speculator: {...} ``` **V2 build configuration:** ```yaml theme={"system"} trt_llm: inference_stack: v2 build: checkpoint_repository: {...} quantization_type: fp8 num_builder_gpus: 4 runtime: max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 8192 tensor_parallel_size: 4 ``` ### Key differences 1. **`inference_stack`**: Explicitly set to `v2` 2. **Simplified build options**: Many V1 options moved to engine 3. **No `base_model`**: Automatically detected from checkpoint 4. **No `plugin_configuration`**: Handled automatically 5. **No `speculator`**: Lookahead decoding requires FDE involement. 6. **Tensor parallel**: Moved to engine as `tensor_parallel_size` ## Validation and troubleshooting ### Common V2 configuration errors **Error:** `Field trt_llm.build.base_model is not allowed to be set when using v2 inference stack` * **Cause:** Setting `base_model` in V2 configuration * **Fix:** Remove `base_model` field, V2 detects automatically **Error:** `Field trt_llm.build.quantization_type is not allowed to be set when using v2 inference stack` * **Cause:** Using unsupported quantization type * **Fix:** Use supported quantization: `no_quant`, `fp8`, `fp4`, `fp4_mlp_only`, `fp4_kv`, `fp8_kv` **Error:** `Field trt_llm.build.speculator is not allowed to be set when using v2 inference stack` * **Cause:** Trying to use lookahead decoding in V2 * **Fix:** Use V1 stack for lookahead decoding, or V2 without speculation or reach out to us to use V2 with speculation. ## Migration from V1 ### V1 to V2 migration **V1 configuration:** ```yaml theme={"system"} trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "Qwen/Qwen3-4B" max_seq_len: 32768 max_batch_size: 256 max_num_tokens: 8192 quantization_type: fp8_kv tensor_parallel_count: 1 plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: true runtime: kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true ``` **V2 configuration:** ```yaml theme={"system"} trt_llm: inference_stack: v2 build: checkpoint_repository: source: HF repo: "Qwen/Qwen3-4B" quantization_type: fp8_kv runtime: max_seq_len: 32768 max_batch_size: 256 max_num_tokens: 8192 tensor_parallel_size: 1 enable_chunked_prefill: true ``` ### Migration steps 1. **Add `inference_stack: v2`** 2. **Remove `base_model`** (auto-detected) 3. \*\*Move `max_seq_len`, `max_batch_size`, `max_num_tokens` to engine 4. **Change `tensor_parallel_count` to `tensor_parallel_size`** 5. **Remove `plugin_configuration`** (handled automatically) 6. **Update quantization type** (V2 has simplified options) 7. **Remove `speculator`** (not supported in V2) ## Hardware selection **GPU recommendations for V2:** * **B200**: Best for FP4 quantization and next-gen performance * **H100**: Best for FP8 quantization and production workloads * **Multi-GPU**: Required for large MoE models (>30B parameters) **Configuration guidelines:** | **Model Size** | **Recommended GPU** | **Quantization** | **Tensor Parallel** | | -------------- | ------------------- | ---------------- | ------------------- | | `<30B` MoE | H100:2-4 | FP8 | 2-4 | | 30-100B MoE | H100:4-8 | FP8 | 4-8 | | 100B+ MoE | B200:4-8 | FP4 | 4-8 | | Dense >30B | H100:2-4 | FP8 | 2-4 | ## Further reading * [BIS-LLM overview](/engines/bis-llm/overview) - Main engine documentation * [Advanced features documentation](/engines/bis-llm/advanced-features) - Enterprise features and capabilities * [Structured outputs for BIS-LLM](/engines/performance-concepts/structured-outputs) - Advanced JSON schema validation * [Examples section](/examples/overview) - Concrete deployment examples --- # Source: https://docs.baseten.co/development/model/build-commands.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Custom build commands > How to run your own docker commands during the build stage The `build_commands` feature allows you to **run custom Docker commands** during the **build stage**, enabling **advanced caching**, **dependency management**, **and environment setup**. **Use Cases:** * Clone GitHub repositories * Install dependencies * Create directories * Pre-download model weights ## 1. Using Build Commands in `config.yaml` Add `build_commands` to your `config.yaml`: ```yaml theme={"system"} build_commands: - git clone https://github.com/comfyanonymous/ComfyUI.git - cd ComfyUI && git checkout b1fd26fe9e55163f780bf9e5f56bf9bf5f035c93 && pip install -r requirements.txt model_name: Build Commands Demo python_version: py310 resources: accelerator: A100 use_gpu: true ``` **What happens?** * The GitHub repository is cloned. * The specified commit is checked out. * Dependencies are installed. * **Everything is cached at build time**, reducing deployment cold starts. ## 2. Creating Directories in Your Truss Use `build_commands` to **create directories** directly in the container. ```yaml theme={"system"} build_commands: - git clone https://github.com/comfyanonymous/ComfyUI.git - cd ComfyUI && mkdir ipadapter - cd ComfyUI && mkdir instantid ``` Useful for **large codebases** requiring additional structure. ## 3. Caching Model Weights Efficiently For large weights (10GB+), use `model_cache` or `external_data`. For smaller weights, **use** `wget` in `build_commands`: ```yaml theme={"system"} build_commands: - git clone https://github.com/comfyanonymous/ComfyUI.git - cd ComfyUI && pip install -r requirements.txt - cd ComfyUI/models/controlnet && wget -O control-lora-canny-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-canny-rank256.safetensors - cd ComfyUI/models/controlnet && wget -O control-lora-depth-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-depth-rank256.safetensors model_name: Build Commands Demo python_version: py310 resources: accelerator: A100 use_gpu: true system_packages: - wget ``` **Why use this?** * **Reduces startup time** by **preloading model weights** during the build stage. * **Ensures availability** without runtime downloads. ## 4. Running Any Shell Command The `build_commands` feature lets you execute **any** shell command as if running it locally, with the benefit of **caching the results** at build time. **Key Benefits:** * **Reduces cold starts** by caching dependencies & data. * **Ensures reproducibility** across deployments. * **Optimizes environment setup** for fast execution. --- # Source: https://docs.baseten.co/development/model/build-your-first-model.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Your first model > Build and deploy your first model This quickstart guide shows you how to build and deploy your first model, using Baseten's Truss framework. ## Prerequisites To use Truss, install a recent Truss version and ensure pydantic is v2: ```bash theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} pip install --upgrade truss 'pydantic>=2.0.0' ``` Truss requires python `>=3.9,<3.15`. To set up a fresh development environment, you can use the following commands, creating a environment named `truss_env` using `pyenv`: ```bash theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} curl https://pyenv.run | bash echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc source ~/.bashrc pyenv install 3.11.0 ENV_NAME="truss_env" pyenv virtualenv 3.11.0 $ENV_NAME pyenv activate $ENV_NAME pip install --upgrade truss 'pydantic>=2.0.0' ``` To deploy Truss remotely, you also need a [Baseten account](https://app.baseten.co/signup). It is handy to export your API key to the current shell session or permanently in your `.bashrc`: ```bash ~/.bashrc theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} theme={"system"} export BASETEN_API_KEY="nPh8..." ``` ## Initialize your model Truss is a tool that helps you package your model code and configuration, and ship it to Baseten for deployment, testing, and scaling. To create your first model, you can use the `truss init` command. ```bash theme={"system"} $ truss init hello-world ? 📦 Name this model: HelloWorld Truss HelloWorld was created in ~/hello-world ``` This will create a new directory called `hello-world` with the following files: * `config.yaml` - A configuration file for your model. * `model/model.py` - A Python file that contains your model code * `packages/` - A folder to hold any dependencies your model needs * `data/` - A folder to hold any data your model needs For this example, we'll focus on the `config.yaml` file and the `model.py` file. ### `config.yaml` The `config.yaml` file is used to configure dependencies, resources, and other settings for your model. Let's take a look at the contents: ```yaml config.yaml theme={"system"} build_commands: [] environment_variables: {} external_package_dirs: [] model_metadata: {} model_name: HelloWorld python_version: py311 requirements: [] resources: accelerator: null cpu: '1' memory: 2Gi use_gpu: false secrets: {} system_packages: [] ``` Some key fields to note: * `requirements`: This is a list of `pip` packages that will be installed when your model is deployed. * `resources`: This is where you can specify the resources your model will use. * `secrets`: This is where you can specify any secrets your model will need, such as HuggingFace API keys. See the [Configuration](/development/model/configuration) page for more information on the `config.yaml` file. ### `model.py` Next, let's take a look at the `model.py` file. ```python theme={"system"} class Model: def __init__(self, **kwargs): pass def load(self): pass def predict(self, model_input): return model_input ``` In Truss models, we expect users to provide a Python class with the following methods: * `__init__`: This is the constructor. * `load`: This is called at model startup, and should include any setup logic, such as weight downloading or initialization * `predict`: This is the method that is called during inference. ## Deploy your model To deploy your model for development with live reload, run: ```bash theme={"system"} $ truss push --watch ``` This will deploy your model to Baseten as a development deployment with live reload enabled. When no flag is specified, `truss push` defaults to a published deployment. Use `--watch` for development deployments with live reload support, or `--publish` explicitly for production-ready deployments. ## Invoke your model After deploying your model, you can invoke it with the invocation URL provided: ```bash theme={"system"} $ curl -X POST https://model-{model-id}.api.baseten.co/development/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '"some text"' "some text" ``` ## A Real Example To show a slightly more complex example, let's deploy a text classification model from HuggingFace! In this example, we'll use the `transformers` library to load a pre-trained model, from HuggingFace, and use it to classify the given text. ### `config.yaml` To deploy this model, we need to add a few more dependencies to our `config.yaml` file. ```yaml config.yaml theme={"system"} requirements: - transformers - torch ``` ### `model.py` Next, let's change our `model.py` file to use the `transformers` library to load the model, and then use it to predict the sentiment of a given text. ```python model.py theme={"system"} from transformers import pipeline class Model: def __init__(self, **kwargs): pass def load(self): self._model = pipeline("text-classification") def predict(self, model_input): return self._model(model_input) ``` ## Running inference Similarly to our previous example, we can deploy this model using `truss push --watch` ```bash theme={"system"} $ truss push --watch ``` And then invoke it using the invocation URL on Baseten. ```bash theme={"system"} $ curl -X POST https://model-{model-id}.api.baseten.co/development/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"text": "some text"}' ``` ## Next steps Now that you've deployed your first model, you can learn more about more options for [configuring your model](/development/model/configuration), and [implementing your model](/development/model/implementation). --- # Source: https://docs.baseten.co/training/concepts/cache.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Cache > Learn how to use the training cache to speed up your training iterations by persisting data between jobs. The training cache enables you to persist data between training jobs. This can significantly improve iteration speed by skipping expensive downloads and data transformations. ## How to Use the Training Cache Set the cache configuration in your `Runtime`: ```python theme={"system"} from truss_train import definitions training_runtime = definitions.Runtime( # ... other configuration options cache_config=definitions.CacheConfig(enabled=True) ) ``` ## Cache Directory By default, the cache will be mounted in two locations * `/root/.cache/user_artifacts`, which can be accessed via the [`$BT_PROJECT_CACHE_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable. This cache is shared by all jobs in a project. * `/root/.cache/team_artifacts`, which can be accessed via the [`$BT_TEAM_CACHE_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable. This cache is shared by all jobs for a team. ## Hugging Face Cache Mount You can mount your cache to the Hugging Face cache directory by setting `HF_HOME` to one of the provided mount points plus `/huggingface`. For example, you can set `HF_HOME=$BT_PROJECT_CACHE_DIR/huggingface` to use the project cache directory. However, there are considerable technical pitfalls when trying to read from the cache with multiple processes, as Huggingface doesn't work well with distributed filesystems. To help enable this use case, ensure your dataset processors or process count is set to 1 to minimize the number of concurrent readers. ## Seeding Your Data and Models For multi-gpu training, you should ensure that your data is seeded before running multi-process training jobs. You can do this by separating out a data loading script and a training script. For a 400 GB HF Dataset, you can expect to save *nearly an hour* of compute time for each job - data download and preparation have been done already! ## Cache Management You can inspect the contents of the cache through CLI with `truss train cache summarize `. This visibility into what's in the cache can help you verify your code is working as expected, and additionally manage files and artifacts you no longer need. When you delete a project, all data in the project's training cache (`$BT_PROJECT_CACHE_DIR`) is permanently deleted with no archival or recovery option. See [Management](/training/management) for details. --- # Source: https://docs.baseten.co/inference/calling-your-model.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Call your model > Run inference on deployed models Once deployed, your model is accessible through an [API endpoint](/reference/inference-api/overview). To make an inference request, you'll need: * **Model ID**: Found in the Baseten dashboard or returned when you deploy. * **[API key](/organization/api-keys)**: Authenticates your requests. * **JSON-serializable model input**: The data your model expects. ## Authentication Include your API key in the `Authorization` header: ```sh theme={"system"} curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{"prompt": "Hello, world!"}' ``` In Python with requests: ```python theme={"system"} import requests import os api_key = os.environ["BASETEN_API_KEY"] model_id = "YOUR_MODEL_ID" response = requests.post( f"https://model-{model_id}.api.baseten.co/environments/production/predict", headers={"Authorization": f"Api-Key {api_key}"}, json={"prompt": "Hello, world!"}, ) print(response.json()) ``` ## Predict API endpoints Baseten provides multiple endpoints for different inference modes: * [`/predict`](/reference/inference-api/overview#predict-endpoints) – Standard synchronous inference. * [`/async_predict`](/reference/inference-api/overview#predict-endpoints) – Asynchronous inference for long-running tasks. Endpoints are available for environments and all deployments. See the [API reference](/reference/inference-api/overview) for details. ## Sync API endpoints Custom servers support both `predict` endpoints as well as a special `sync` endpoint. By using the `sync` endpoint you are able to call different routes in your custom server. ``` https://model-{model-id}.api.baseten.co/environments/{production}/sync/{route} ``` Here are a few examples that show how the sync endpoint maps to the custom server's routes. * `https://model-{model_id}.../sync/health` -> `/health` * `https://model-{model_id}.../sync/items` -> `/items` * `https://model-{model_id}.../sync/items/123` -> `/items/123` ## OpenAI SDK When deploying a model with Engine-Builder, you will get an OpenAI compatible server. If you are already using one of the OpenAI SDKs, you will simply need to update the base url to your Baseten model URL and include your Baseten API Key. ```python theme={"system"} import os from openai import OpenAI model_id = "abcdef" # TODO: replace with your model id api_key = os.environ.get("BASETEN_API_KEY") model_url = f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1" client = OpenAI( base_url=model_url, api_key=api_key, ) stream = client.chat.completions.create( model="baseten", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") ``` ## Alternative invocation methods * **Truss CLI**: [`truss predict`](/reference/cli/truss/predict) * **Model Dashboard**: "Playground" button in the Baseten UI --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/cancel-async-request.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async cancel request > Use this endpoint to cancel a queued async request. Only `QUEUED` requests may be canceled. ### Parameters The ID of the model. The ID of the chain. The ID of the async request. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Response The ID of the async request. Whether the request was canceled. Additional details about whether the request was canceled. ### Rate limits Calls to the cancel async request status endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code. ```python Python (Model) theme={"system"} import requests import os model_id = "" request_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.delete( f"https://model-{model_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` ```python Python (Chain) theme={"system"} import requests import os chain_id = "" request_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.delete( f"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` --- # Source: https://docs.baseten.co/reference/management-api/deployments/promote/cancel-promotion.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Cancel model promotion > Cancels an ongoing promotion to an environment and returns the cancellation status. ```json 200 theme={"system"} { "status": "CANCELED", "message": "Promotion to production was successfully canceled." } ``` ```json 400 theme={"system"} { "code": "VALIDATION_ERROR", "message": "Environment production has no in progress promotion." } ``` ## OpenAPI ````yaml post /v1/models/{model_id}/environments/{env_name}/cancel_promotion openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments/{env_name}/cancel_promotion: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/env_name' post: summary: Cancels a promotion to an environment description: >- Cancels an ongoing promotion to an environment and returns the cancellation status. responses: '200': description: The response to a request to cancel a promotion. content: application/json: schema: $ref: '#/components/schemas/CancelPromotionResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true env_name: schema: type: string name: env_name in: path required: true schemas: CancelPromotionResponseV1: description: The response to a request to cancel a promotion. properties: status: $ref: '#/components/schemas/CancelPromotionStatusV1' description: >- Status of the request to cancel a promotion. Can be CANCELED or RAMPING_DOWN. message: description: A message describing the status of the request to cancel a promotion title: Message type: string required: - status - message title: CancelPromotionResponseV1 type: object CancelPromotionStatusV1: description: The status of a request to cancel a promotion. enum: - CANCELED - RAMPING_DOWN title: CancelPromotionStatusV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/examples/chains-audio-transcription.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Transcribe audio with Chains > Process hours of audio in seconds using efficient chunking, distributed inference, and optimized GPU resources. This guide walks through building an audio transcription pipeline using Chains. You'll break down large media files, distribute transcription tasks across autoscaling deployments, and leverage high-performance GPUs for rapid inference. # 1. Overview This Chain enables fast, high-quality transcription by: * **Partitioning** long files (10+ hours) into smaller segments. * **Detecting silence** to optimize split points. * **Parallelizing inference** across multiple GPU-backed deployments. * **Batching requests** to maximize throughput. * **Using range downloads** for efficient data streaming. * Leveraging `asyncio` for concurrent execution. # 2. Chain Structure Transcription is divided into two processing layers: 1. **Macro chunks:** Large segments (\~300s) split from the source media file. These are processed in parallel to handle massive files efficiently. 2. **Micro chunks:** Smaller segments (\~5–30s) extracted from macro chunks and sent to the Whisper model for transcription. # 3. Implementing the Chainlets ## `Transcribe` (Entrypoint Chainlet) Handles transcription requests and dispatches tasks to worker Chainlets. Function signature: ```python theme={"system"} async def run_remote( self, media_url: str, params: data_types.TranscribeParams ) -> data_types.TranscribeOutput: ``` **Steps:** * Validates that the media source supports **range downloads**. * Uses **FFmpeg** to extract metadata and duration. * Splits the file into **macro chunks**, optimizing split points at silent sections. * Dispatches **macro chunk tasks** to the MacroChunkWorker for processing. * Collects **micro chunk transcriptions**, merges results, and returns the final text. **Example request:** ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` ```json theme={"system"} { "media_url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4", "params": { "micro_chunk_size_sec": 30, "macro_chunk_size_sec": 300 } } ``` ## `MacroChunkWorker` (Processing Chainlet) Processes **macro chunks** by: * **Extracting** relevant time segments using **FFmpeg**. * **Streaming audio** instead of downloading full files for low latency. * **Splitting segments** at silent points. * **Encoding** audio in base64 for efficient transfer. * **Distributing micro chunks** to the Whisper model for transcription. This Chainlet **runs in parallel** with multiple instances autoscaled dynamically. ## `WhisperModel` (Inference Model) A separately deployed **Whisper** model Chainlet handles speech-to-text transcription. * Deployed **independently** to allow fast iteration on business logic without redeploying the model. * Used **across different Chains** or accessed directly as a standalone model. * Supports **multiple environments** (e.g., dev, prod) using the same instance. Whisper can also be deployed as a **standard Truss model**, separate from the Chain. # 4. Optimizing Performance Even for very large files, **processing time remains bounded** by parallel execution. ## Key performance tuning parameters: * `micro_chunk_size_sec` → Balance GPU utilization and inference latency. * `macro_chunk_size_sec` → Adjust chunk size for optimal parallelism. * **Autoscaling settings** → Tune concurrency and replica counts for load balancing. Example speedup: ```json theme={"system"} { "input_duration_sec": 734.26, "processing_duration_sec": 82.42, "speedup": 8.9 } ``` # 5. Deploy and run the Chain ## Deploy WhisperModel first: ```bash theme={"system"} truss chains push whisper_chainlet.py ``` Copy the **invocation URL** and update `WHISPER_URL` in `transcribe.py`. ## Deploy the transcription Chain: ```bash theme={"system"} truss chains push transcribe.py ``` ## Run transcription on a sample file: ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` *** # Next Steps * Learn more about [Chains](/development/chain/overview). * Optimize GPU **autoscaling** for peak efficiency. * Extend the pipeline with **custom business logic**. --- # Source: https://docs.baseten.co/examples/chains-build-rag.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # RAG pipeline with Chains > Build a RAG (retrieval-augmented generation) pipeline with Chains [Learn more about Chains](/development/chain/overview) ## Prerequisites To use Chains, install a recent Truss version and ensure pydantic is v2: ```bash theme={"system"} pip install --upgrade truss 'pydantic>=2.0.0' ``` Truss requires python `>=3.9,<3.15`. To set up a fresh development environment, you can use the following commands, creating a environment named `chains_env` using `pyenv`: ```bash theme={"system"} curl https://pyenv.run | bash echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc source ~/.bashrc pyenv install 3.11.0 ENV_NAME="chains_env" pyenv virtualenv 3.11.0 $ENV_NAME pyenv activate $ENV_NAME pip install --upgrade truss 'pydantic>=2.0.0' ``` To deploy Chains remotely, you also need a [Baseten account](https://app.baseten.co/signup). It is handy to export your API key to the current shell session or permanently in your `.bashrc`: ```bash ~/.bashrc theme={"system"} export BASETEN_API_KEY="nPh8..." ``` If you want to run this example in [local debugging mode](/development/chain/localdev#test-a-chain-locally), you'll also need to install chromadb: ```shell theme={"system"} pip install chromadb ``` The complete code used in this tutorial can also be found in the [Chains examples repo](https://github.com/basetenlabs/models/tree/main/truss-chains/examples/rag). # Overview Retrieval-augmented generation (RAG) is a multi-model pipeline for generating context-aware answers from LLMs. There are a number of ways to build a RAG system. This tutorial shows a minimum viable implementation with a basic vector store and retrieval function. It's intended as a starting point to show how Chains helps you flexibly combine model inference and business logic. In this tutorial, we'll build a simple RAG pipeline for a hypothetical alumni matching service for a university. The system: 1. Takes a bio with information about a new graduate 2. Uses a vector database to retrieve semantically similar bios of other alums 3. Uses an LLM to explain why the new graduate should meet the selected alums 4. Returns the writeup from the LLM Let's dive in! ## Building the Chain Create a file `rag.py` in a new directory with: ```sh theme={"system"} mkdir rag touch rag/rag.py cd rag ``` Our RAG Chain is composed of three parts: * `VectorStore`, a Chainlet that implements a vector database with a retrieval function. * `LLMClient`, a Stub for connecting to a deployed LLM. * `RAG`, the entrypoint Chainlet that orchestrates the RAG pipeline and has `VectorStore` and `LLMClient` as dependencies. We'll examine these components one by one and then see how they all work together. ### Vector store Chainlet A real production RAG system would use a hosted vector database with a massive number of stored embeddings. For this example, we're using a small local vector store built with `chromadb` to stand in for a more complex system. The Chainlet has three parts: * [`remote_config`](/reference/sdk/chains#remote-configuration), which configures a Docker image on deployment with dependencies. * `__init__()`, which runs once when the Chainlet is spun up, and creates the vector database with ten sample bios. * [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which runs each time the Chainlet is called and is the sole public interface for the Chainlet. ```python rag/rag.py theme={"system"} import truss_chains as chains # Create a Chainlet to serve as our vector database. class VectorStore(chains.ChainletBase): # Add chromadb as a dependency for deployment. remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( pip_requirements=["chromadb"] ) ) # Runs once when the Chainlet is deployed or scaled up. def __init__(self): # Import Chainlet-specific dependencies in init, not at the top of # the file. import chromadb self._chroma_client = chromadb.EphemeralClient() self._collection = self._chroma_client.create_collection(name="bios") # Sample documents are hard-coded for your convenience documents = [ "Angela Martinez is a tech entrepreneur based in San Francisco. As the founder and CEO of a successful AI startup, she is a leading figure in the tech community. Outside of work, Angela enjoys hiking the trails around the Bay Area and volunteering at local animal shelters.", "Ravi Patel resides in New York City, where he works as a financial analyst. Known for his keen insight into market trends, Ravi spends his weekends playing chess in Central Park and exploring the city's diverse culinary scene.", "Sara Kim is a digital marketing specialist living in San Francisco. She helps brands build their online presence with creative strategies. Outside of work, Sara is passionate about photography and enjoys hiking the trails around the Bay Area.", "David O'Connor calls New York City his home and works as a high school teacher. He is dedicated to inspiring the next generation through education. In his free time, David loves running along the Hudson River and participating in local theater productions.", "Lena Rossi is an architect based in San Francisco. She designs sustainable and innovative buildings that contribute to the city's skyline. When she's not working, Lena enjoys practicing yoga and exploring art galleries.", "Akio Tanaka lives in Tokyo and is a software developer specializing in mobile apps. Akio is an avid gamer and enjoys attending eSports tournaments. He also has a passion for cooking and often experiments with new recipes in his spare time.", "Maria Silva is a nurse residing in New York City. She is dedicated to providing compassionate care to her patients. Maria finds joy in gardening and often spends her weekends tending to her vibrant flower beds and vegetable garden.", "John Smith is a journalist based in San Francisco. He reports on international politics and has a knack for uncovering compelling stories. Outside of work, John is a history buff who enjoys visiting museums and historical sites.", "Aisha Mohammed lives in Tokyo and works as a graphic designer. She creates visually stunning graphics for a variety of clients. Aisha loves to paint and often showcases her artwork in local exhibitions.", "Carlos Mendes is an environmental engineer in San Francisco. He is passionate about developing sustainable solutions for urban areas. In his leisure time, Carlos enjoys surfing and participating in beach clean-up initiatives." ] # Add all documents to the database self._collection.add( documents=documents, ids=[f"id{n}" for n in range(len(documents))] ) # Runs each time the Chainlet is called async def run_remote(self, query: str) -> list[str]: # This call to includes embedding the query string. results = self._collection.query(query_texts=[query], n_results=2) if results is None or not results: raise ValueError("No bios returned from the query") if not results["documents"] or not results["documents"][0]: raise ValueError("Bios are empty") return results["documents"][0] ``` ### LLM inference stub Now that we can retrieve relevant bios from the vector database, we need to pass that information to an LLM to generate our final output. Chains can integrate previously deployed models using a Stub. Like Chainlets, Stubs implement [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), but as a call to the deployed model. For our LLM, we'll use Phi-3 Mini Instruct, a small-but-mighty open source LLM. One-click model deployment from Baseten's model library. While the model is deploying, be sure to note down the models' invocation URL from the model dashboard for use in the next step. To use our deployed LLM in the RAG Chain, we define a Stub: ```python rag/rag.py theme={"system"} class LLMClient(chains.StubBase): # Runs each time the Stub is called async def run_remote(self, new_bio: str, bios: list[str]) -> str: # Use the retrieved bios to augment the prompt -- here's the "A" in RAG! prompt = f"""You are matching alumni of a college to help them make connections. Explain why the person described first would want to meet the people selected from the matching database. Person you're matching: {new_bio} People from database: {" ".join(bios)}""" # Call the deployed model. resp = await self._remote.predict_async(json_payload={ "messages": [{"role": "user", "content": prompt}], "stream" : False }) return resp["output"][len(prompt) :].strip() ``` ### RAG entrypoint Chainlet The entrypoint to a Chain is the Chainlet that specifies the public-facing input and output of the Chain and orchestrates calls to dependencies. The `__init__` function in this Chainlet takes two new arguments: * Add dependencies to any Chainlet with [`chains.depends()`](/reference/sdk/chains#truss-chains-depends). Only Chainlets, not Stubs, need to be added in this fashion. * Use [`chains.depends_context()`](/reference/sdk/chains#truss-chains-depends-context) to inject a context object at runtime. This context object is required to initialize the `LLMClient` stub. * Visit your [baseten workspace](https://app.baseten.co/models) to find your the URL of the previously deployed Phi-3 model and insert if as value for `LLM_URL`. ```python rag/rag.py theme={"system"} # Insert the URL from the previously deployed Phi-3 model. LLM_URL = ... @chains.mark_entrypoint class RAG(chains.ChainletBase): # Runs once when the Chainlet is spun up def __init__( self, # Declare dependency chainlets. vector_store: VectorStore = chains.depends(VectorStore), context: chains.DeploymentContext = chains.depends_context(), ): self._vector_store = vector_store # The stub needs the context for setting up authentication. self._llm = LLMClient.from_url(LLM_URL, context) # Runs each time the Chain is called async def run_remote(self, new_bio: str) -> str: # Use the VectorStore Chainlet for context retrieval. bios = await self._vector_store.run_remote(new_bio) # Use the LLMClient Stub for augmented generation. contacts = await self._llm.run_remote(new_bio, bios) return contacts ``` ## Testing locally Because our Chain uses a Stub for the LLM call, we can run the whole Chain locally without any GPU resources. Before running the Chainlet, make sure to set your Baseten API key as an environment variable `BASETEN_API_KEY`. ```python rag/rag.py theme={"system"} if __name__ == "__main__": import os import asyncio with chains.run_local( # This secret is needed even locally, because part of this chain # calls the separately deployed Phi-3 model. Only the Chainlets # actually run locally. secrets={"baseten_chain_api_key": os.environ["BASETEN_API_KEY"]} ): rag_client = RAG() result = asyncio.get_event_loop().run_until_complete( rag_client.run_remote( """ Sam just moved to Manhattan for his new job at a large bank. In college, he enjoyed building sets for student plays. """ ) ) print(result) ``` We can run our Chain locally: ```sh theme={"system"} python rag.py ``` After a few moments, we should get a recommendation for why Sam should meet the alumni selected from the database. ## Deploying to production Once we're satisfied with our Chain's local behavior, we can deploy it to production on Baseten. To deploy the Chain, run: ```sh theme={"system"} truss chains push rag.py ``` This will deploy our Chain as a development deployment. Once the Chain is deployed, we can call it from its API endpoint. You can do this in the console with cURL: ```sh theme={"system"} curl -X POST 'https://chain-5wo86nn3.api.baseten.co/development/run_remote' \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"new_bio": "Sam just moved to Manhattan for his new job at a large bank.In college, he enjoyed building sets for student plays."}' ``` Alternatively, you can also integrate this in a Python application: ```python call_chain.py theme={"system"} import requests import os # Insert the URL from the deployed rag chain. You can get it from the CLI # output or the status page, e.g. # "https://chain-6wgeygoq.api.baseten.co/production/run_remote". RAG_CHAIN_URL = "" baseten_api_key = os.environ["BASETEN_API_KEY"] if not RAG_CHAIN_URL: raise ValueError("Please insert the URL for the RAG chain.") resp = requests.post( RAG_CHAIN_URL, headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={"new_bio": new_bio}, ) print(resp.json()) ``` When we're happy with the deployed Chain, we can promote it to production via the UI or by running: ```sh theme={"system"} truss chains push --promote rag.py ``` Once in production, the Chain will have access to full autoscaling settings. Both the development and production deployments will scale to zero when not in use. --- # Source: https://docs.baseten.co/reference/cli/chains/chains-cli.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chains CLI reference > Deploy, manage, and develop Chains using the Truss CLI. ```sh theme={"system"} truss chains [OPTIONS] COMMAND [ARGS]... ``` | Command | Description | | ----------------- | -------------------------- | | [`init`](#init) | Initialize a Chain project | | [`push`](#push) | Deploy a Chain | | [`watch`](#watch) | Live reload development | *** ## `init` Initialize a Chain project. ```sh theme={"system"} truss chains init [OPTIONS] [DIRECTORY] ``` * `DIRECTORY` (optional): Path to a new or empty directory for the Chain. Defaults to the current directory if omitted. **Options:** * `--log` `[humanfriendly | INFO | DEBUG]`: Set log verbosity. * `--help`: Show this message and exit. **Example:** To create a new Chain project in a directory called `my-chain`, use the following: ```sh theme={"system"} truss chains init my-chain ``` *** ## `push` Deploy a Chain. ```sh theme={"system"} truss chains push [OPTIONS] SOURCE [ENTRYPOINT] ``` * `SOURCE`: Path to a Python file that contains the entrypoint chainlet. * `ENTRYPOINT` (optional): Class name of the entrypoint chainlet. If omitted, the chainlet tagged with `@chains.mark_entrypoint` is used. **Options:** * `--name` (TEXT): Custom name for the Chain (defaults to entrypoint name). * `--publish / --no-publish`: Create chainlets as a published deployment. * `--promote / --no-promote`: Promote newly deployed chainlets into production. * `--environment` (TEXT): Deploy chainlets into a particular environment. * `--wait / --no-wait`: Wait until all chainlets are ready (or deployment failed). * `--watch / --no-watch`: Watch the Chains source code and apply live patches. Using this option waits for the Chain to be deployed (the `--wait` flag is applied) before starting to watch for changes. This option requires the deployment to be a development deployment. * `--experimental-chainlet-names` (TEXT): Run `watch`, but only apply patches to specified chainlets. The option is a comma-separated list of chainlet (display) names. This option can give faster dev loops, but also lead to inconsistent deployments. Use with caution and refer to [docs](/development/chain/watch). * `--dryrun`: Produce only generated files, but don't deploy anything. * `--remote` (TEXT): Name of the remote in .trussrc to push to. * `--team` (TEXT): Name of the team to deploy to. If not specified, Truss infers the team or prompts for selection. * `--log` `[humanfriendly|I|INFO|D|DEBUG]`: Customize logging. * `--help`: Show this message and exit. The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information. **Example:** To deploy a Chain as a development deployment, use the following: ```sh theme={"system"} truss chains push my_chain.py ``` To deploy and promote to production, use the following: ```sh theme={"system"} truss chains push my_chain.py --publish --promote ``` To deploy to a specific team, use the following: ```sh theme={"system"} truss chains push my_chain.py --team my-team-name ``` *** ## `watch` Live reload development. ```sh theme={"system"} truss chains watch [OPTIONS] SOURCE [ENTRYPOINT] ``` * `SOURCE`: Path to a Python file containing the entrypoint chainlet. * `ENTRYPOINT` (optional): Class name of the entrypoint chainlet. If omitted, the chainlet tagged with `@chains.mark_entrypoint` is used. **Options:** * `--name` (TEXT): Name of the Chain to be deployed. If not given, the entrypoint name is used. * `--remote` (TEXT): Name of the remote in .trussrc to push to. * `--team` (TEXT): Name of the team to deploy to. If not specified, Truss infers the team or prompts for selection. The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information. * `--experimental-chainlet-names` (TEXT): Run `watch`, but only apply patches to specified chainlets. The option is a comma-separated list of chainlet (display) names. This option can give faster dev loops, but also lead to inconsistent deployments. Use with caution and refer to [docs](/development/chain/watch). * `--log` `[humanfriendly|W|WARNING|I|INFO|D|DEBUG]`: Customize logging. * `--help`: Show this message and exit. **Example:** To watch a Chain for live reload during development, use the following: ```sh theme={"system"} truss chains watch my_chain.py ``` --- # Source: https://docs.baseten.co/reference/sdk/chains.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chains SDK Reference > Python SDK Reference for Chains # Chainlet classes APIs for creating user-defined Chainlets. ### *class* `truss_chains.ChainletBase` Base class for all chainlets. Inheriting from this class adds validations to make sure subclasses adhere to the chainlet pattern and facilitates remote chainlet deployment. Refer to [the docs](/development/chain/getting-started) and this [example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py) for more guidance on how to create subclasses. ### *class* `truss_chains.ModelBase` Base class for all standalone models. Inheriting from this class adds validations to make sure subclasses adhere to the truss model pattern. ### *class* `truss_chains.EngineBuilderLLMChainlet` #### *method final async* run\_remote(llm\_input) **Parameters:** | Name | Type | Description | | ----------- | ----------------------- | -------------------------- | | `llm_input` | *EngineBuilderLLMInput* | OpenAI compatible request. | * **Returns:** *AsyncIterator*\[str] ### *function* `truss_chains.depends` Sets a “symbolic marker” to indicate to the framework that a chainlet is a dependency of another chainlet. The return value of `depends` is intended to be used as a default argument in a chainlet’s `__init__`-method. When deploying a chain remotely, a corresponding stub to the remote is injected in its place. In [`run_local`](#function-truss-chains-run-local) mode an instance of a local chainlet is injected. Refer to [the docs](/development/chain/getting-started) and this [example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py) for more guidance on how make one chainlet depend on another chainlet. Despite the type annotation, this does *not* immediately provide a chainlet instance. Only when deploying remotely or using `run_local` a chainlet instance is provided. **Parameters:** | Name | Type | Default | Description | | ------------------- | --------------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `chainlet_cls` | *Type\[[ChainletBase](#class-truss-chains-chainletbase)]* | | The chainlet class of the dependency. | | `retries` | *int* | `1` | The number of times to retry the remote chainlet in case of failures (e.g. due to transient network issues). For streaming, retries are only made if the request fails before streaming any results back. Failures mid-stream not retried. | | `timeout_sec` | *float* | `600.0` | Timeout for the HTTP request to this chainlet. | | `use_binary` | *bool* | `False` | Whether to send data in binary format. This can give a parsing speedup and message size reduction (\~25%) for numpy arrays. Use `NumpyArrayField` as a field type on pydantic models for integration and set this option to `True`. For simple text data, there is no significant benefit. | | `concurrency_limit` | *int* | `300` | The maximum number of concurrent requests to send to the remote chainlet. Excessive requests will be queued and a warning will be shown. Try to design your algorithm in a way that spreads requests evenly over time so that this the default value can be used. | * **Returns:** A “symbolic marker” to be used as a default argument in a chainlet’s initializer. ### *function* `truss_chains.depends_context` Sets a “symbolic marker” for injecting a context object at runtime. Refer to [the docs](/development/chain/getting-started) and this [example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py) for more guidance on the `__init__`-signature of chainlets. Despite the type annotation, this does *not* immediately provide a context instance. Only when deploying remotely or using `run_local` a context instance is provided. * **Returns:** A “symbolic marker” to be used as a default argument in a chainlet’s initializer. ### *class* `truss_chains.DeploymentContext` Bases: `pydantic.BaseModel` Bundles config values and resources needed to instantiate Chainlets. The context can optionally be added as a trailing argument in a Chainlet’s `__init__` method and then used to set up the chainlet (e.g. using a secret as an access token for downloading model weights). **Parameters:** | Name | Type | Default | Description | | --------------------- | ------------------------------------------------------------------------------------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `chainlet_to_service` | *Mapping\[str,[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | | A mapping from chainlet names to service descriptors. This is used to create RPC sessions to dependency chainlets. It contains only the chainlet services that are dependencies of the current chainlet. | | `secrets` | *Mapping\[str,str]* | | A mapping from secret names to secret values. It contains only the secrets that are listed in `remote_config.assets.secret_keys` of the current chainlet. | | `data_dir` | *Path\|None* | `None` | The directory where the chainlet can store and access data, e.g. for downloading model weights. | | `environment` | *[Environment](#class-truss-chains-environment)\|None* | `None` | The environment that the chainlet is deployed in. None if the chainlet is not associated with an environment. | #### *method* get\_baseten\_api\_key() * **Returns:** str #### *method* get\_service\_descriptor(chainlet\_name) **Parameters:** | Name | Type | Description | | --------------- | ----- | ------------------------- | | `chainlet_name` | *str* | The name of the chainlet. | * **Returns:** [*DeployedServiceDescriptor*](#class-truss-chains-deployedservicedescriptor) ### *class* `truss_chains.Environment` Bases: `pydantic.BaseModel` The environment the chainlet is deployed in. * **Parameters:** **name** (*str*) – The name of the environment. ### *class* `truss_chains.ChainletOptions` Bases: `pydantic.BaseModel` **Parameters:** | Name | Type | Default | Description | | ------------------------ | ----------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `enable_b10_tracing` | *bool* | `False` | enables baseten-internal trace data collection. This helps baseten engineers better analyze chain performance in case of issues. It is independent of a potentially user-configured tracing instrumentation. Turning this on, could add performance overhead. | | `enable_debug_logs` | *bool* | `False` | Sets log level to debug in deployed server. | | `env_variables` | *Mapping\[str,str]* | `{}` | static environment variables available to the deployed chainlet. | | `health_checks` | *HealthChecks* | `truss.base.truss_config.HealthChecks()` | Configures health checks for the chainlet. See [guide](https://docs.baseten.co/truss/guides/custom-health-checks#chains). | | `metadata` | *JsonValue\|None* | `None` | Arbitrary JSON object to describe chainlet. | | `streaming_read_timeout` | *int* | `60` | Amount of time (in seconds) between each streamed chunk before a timeout is triggered. | | `transport` | *Union\[HTTPOptions\|WebsocketOptions\|GRPCOptions]'* | `None` | Allows to customize certain transport protocols, e.g. websocket pings. | ### *class* `truss_chains.RPCOptions` Bases: `pydantic.BaseModel` Options to customize RPCs to dependency chainlets. **Parameters:** | Name | Type | Default | Description | | ------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `retries` | *int* | `1` | The number of times to retry the remote chainlet in case of failures (e.g. due to transient network issues). For streaming, retries are only made if the request fails before streaming any results back. Failures mid-stream not retried. | | `timeout_sec` | *float* | `600.0` | Timeout for the HTTP request to this chainlet. | | `use_binary` | *bool* | `False` | Whether to send data in binary format. This can give a parsing speedup and message size reduction (\~25%) for numpy arrays. Use `NumpyArrayField` as a field type on pydantic models for integration and set this option to `True`. For simple text data, there is no significant benefit. | | `concurrency_limit` | *int* | `300` | The maximum number of concurrent requests to send to the remote chainlet. Excessive requests will be queued and a warning will be shown. Try to design your algorithm in a way that spreads requests evenly over time so that this the default value can be used. | ### *function* `truss_chains.mark_entrypoint` Decorator to mark a chainlet as the entrypoint of a chain. This decorator can be applied to *one* chainlet in a source file and then the CLI push command simplifies: only the file, not the class within, must be specified. Optionally a display name for the Chain (not the Chainlet) can be set (effectively giving a custom default value for the `name` arg of the CLI push command). Example usage: ```python theme={"system"} import truss_chains as chains @chains.mark_entrypoint class MyChainlet(ChainletBase): ... # OR with custom Chain name. @chains.mark_entrypoint("My Chain Name") class MyChainlet(ChainletBase): ... ``` # Remote Configuration These data structures specify for each chainlet how it gets deployed remotely, e.g. dependencies and compute resources. ### *class* `truss_chains.RemoteConfig` Bases: `pydantic.BaseModel` Bundles config values needed to deploy a chainlet remotely. This is specified as a class variable for each chainlet class, e.g.: ```python theme={"system"} import truss_chains as chains class MyChainlet(chains.ChainletBase): remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( pip_requirements=["torch==2.0.1", ...] ), compute=chains.Compute(cpu_count=2, gpu="A10G", ...), assets=chains.Assets(secret_keys=["hf_access_token"], ...), ) ``` **Parameters:** | Name | Type | Default | | -------------- | -------------------------------------------------------- | -------------------------------- | | `docker_image` | *[DockerImage](#class-truss-chains-dockerimage)* | `truss_chains.DockerImage()` | | `compute` | *[Compute](#class-truss-chains-compute)* | `truss_chains.Compute()` | | `assets` | *[Assets](#class-truss-chains-assets)* | `truss_chains.Assets()` | | `name` | *str\|None* | `None` | | `options` | *[ChainletOptions](#class-truss-chains-chainletoptions)* | `truss_chains.ChainletOptions()` | ### *class* `truss_chains.DockerImage` Bases: `pydantic.BaseModel` Configures the docker image in which a remoted chainlet is deployed. Any paths are relative to the source file where `DockerImage` is defined and must be created with the helper function \[`make_abs_path_here`] (#function-truss-chains-make-abs-path-here). This allows you for example organize chainlets in different (potentially nested) modules and keep their requirement files right next their python source files. **Parameters:** | Name | Type | Default | Description | | ------------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `base_image` | *[BasetenImage](#class-truss-chains-basetenimage)\|[CustomImage](#class-truss-chains-customimage)* | `truss_chains.BasetenImage()` | The base image used by the chainlet. Other dependencies and assets are included as additional layers on top of that image. You can choose a Baseten default image for a supported python version (e.g. `BasetenImage.PY311`), this will also include GPU drivers if needed, or provide a custom image (e.g. `CustomImage(image="python:3.11-slim")`). | | `pip_requirements_file` | *AbsPath\|None* | `None` | Path to a file containing pip requirements. The file content is naively concatenated with `pip_requirements`. | | `pip_requirements` | *list\[str]* | `[]` | A list of pip requirements to install. The items are naively concatenated with the content of the `pip_requirements_file`. | | `apt_requirements` | *list\[str]* | `[]` | A list of apt requirements to install. | | `data_dir` | *AbsPath\|None* | `None` | Data from this directory is copied into the docker image and accessible to the remote chainlet at runtime. | | `external_package_dirs` | *list\[AbsPath]\|None* | `None` | A list of directories containing additional python packages outside the chain’s workspace dir, e.g. a shared library. This code is copied into the docker image and importable at runtime. | | `truss_server_version_override` | *str\|None* | `None` | By default, deployed Chainlets use the truss server implementation corresponding to the truss version of the user’s CLI. To use a specific version, e.g. pinning it for exact reproducibility, the version can be overridden here. Valid versions correspond to truss releases on PyPi: [https://pypi.org/project/truss/#history](https://pypi.org/project/truss/#history), e.g. “0.9.80”. | ### *class* `truss_chains.BasetenImage` Bases: `Enum` Default images, curated by baseten, for different python versions. If a Chainlet uses GPUs, drivers will be included in the image. | Enum Member | Value | | ----------- | ------- | | `PY39` | *py39* | | `PY310` | *py310* | | `PY311` | *py311* | | `PY312` | *py312* | | `PY313` | *py313* | | `PY314` | *py314* | ### *class* `truss_chains.CustomImage` Bases: `pydantic.BaseModel` Configures the usage of a custom image hosted on dockerhub. **Parameters:** | Name | Type | Default | Description | | ------------------------ | -------------------------- | ------- | ------------------------------------------------------------------------------------------------------ | | `image` | *str* | | Reference to image on dockerhub. | | `python_executable_path` | *str\|None* | `None` | Absolute path to python executable (if default `python` is ambiguous). | | `docker_auth` | *DockerAuthSettings\|None* | `None` | See [corresponding truss config](/development/model/base-images#example%3A-docker-hub-authentication). | ### *class* `truss_chains.Compute` Specifies which compute resources a chainlet has in the *remote* deployment. Not all combinations can be exactly satisfied by available hardware, in some cases more powerful machine types are chosen to make sure requirements are met or over-provisioned. Refer to the [baseten instance reference](https://docs.baseten.co/deployment/resources). **Parameters:** | Name | Type | Default | Description | | --------------------- | ----------------------------- | ------- | --------------------------------------------------------------------------------------------------------------- | | `cpu_count` | *int* | `1` | Minimum number of CPUs to allocate. | | `memory` | *str* | `'2Gi'` | Minimum memory to allocate, e.g. “2Gi” (2 gibibytes). | | `gpu` | *str\|Accelerator\|None* | `None` | GPU accelerator type, e.g. “A10G”, “A100”, refer to the [truss config](/deployment/resources) for more choices. | | `gpu_count` | *int* | `1` | Number of GPUs to allocate. | | `predict_concurrency` | *int\|Literal\['cpu\_count']* | `1` | Number of concurrent requests a single replica of a deployed chainlet handles. | Concurrency concepts are explained in [this guide](/development/model/concurrency#2-predict-concurrency). It is important to understand the difference between predict\_concurrency and the concurrency target (used for autoscaling, i.e. adding or removing replicas). Furthermore, the `predict_concurrency` of a single instance is implemented in two ways: * Via python’s `asyncio`, if `run_remote` is an async def. This requires that `run_remote` yields to the event loop. * With a threadpool if it’s a synchronous function. This requires that the threads don’t have significant CPU load (due to the GIL). ### *class* `truss_chains.Assets` Specifies which assets a chainlet can access in the remote deployment. For example, model weight caching can be used like this: ```python theme={"system"} import truss_chains as chains from truss.base import truss_config mistral_cache = truss_config.ModelRepo( repo_id="mistralai/Mistral-7B-Instruct-v0.2", allow_patterns=["*.json", "*.safetensors", ".model"] ) chains.Assets(cached=[mistral_cache], ...) ``` **Parameters:** | Name | Type | Default | Description | | --------------- | ----------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `cached` | *Iterable\[ModelRepo]* | `()` | One or more `truss_config.ModelRepo` objects. | | `secret_keys` | *Iterable\[str]* | `()` | Names of secrets stored on baseten, that the chainlet should have access to. You can manage secrets on baseten [here](https://app.baseten.co/settings/secrets). | | `external_data` | *Iterable\[ExternalDataItem]* | `()` | Data to be downloaded from public URLs and made available in the deployment (via `context.data_dir`). | # Core General framework and helper functions. ### *function* `truss_chains.push` Deploys a chain remotely (with all dependent chainlets). **Parameters:** | Name | Type | Default | Description | | ----------------------- | -------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `entrypoint` | *Type\[ChainletT]* | | The chainlet class that serves as the entrypoint to the chain. | | `chain_name` | *str* | | The name of the chain. | | `publish` | *bool* | `True` | Whether to publish the chain as a published deployment (it is a draft deployment otherwise) | | `promote` | *bool* | `True` | Whether to promote the chain to be the production deployment (this implies publishing as well). | | `only_generate_trusses` | *bool* | `False` | Used for debugging purposes. If set to True, only the the underlying truss models for the chainlets are generated in `/tmp/.chains_generated`. | | `remote` | *str* | `'baseten'` | name of a remote config in .trussrc. If not provided, it will be inquired. | | `environment` | *str\|None* | `None` | The name of an environment to promote deployment into. | | `progress_bar` | *Type\[progress.Progress]\|None* | `None` | Optional rich.progress.Progress if output is desired. | | `include_git_info` | *bool* | `False` | Whether to attach git versioning info (sha, branch, tag) to deployments made from within a git repo. If set to True in .trussrc, it will always be attached. | * **Returns:** [*ChainService*](#class-truss-chains-remote-chainservice): A chain service handle to the deployed chain. ### *class* `truss_chains.deployment.deployment_client.ChainService` Handle for a deployed chain. A `ChainService` is created and returned when using `push`. It bundles the individual services for each chainlet in the chain, and provides utilities to query their status, invoke the entrypoint etc. #### *method* get\_info() Queries the statuses of all chainlets in the chain. * **Returns:** List of `DeployedChainlet`, `(name, is_entrypoint, status, logs_url)` for each chainlet. #### *property* name *: str* #### *method* run\_remote(json) Invokes the entrypoint with JSON data. **Parameters:** | Name | Type | Description | | ------ | ----------- | ---------------------------- | | `json` | *JSON dict* | Input data to the entrypoint | * **Returns:** The JSON response. #### *property* run\_remote\_url *: str* URL to invoke the entrypoint. #### *property* status\_page\_url *: str* Link to status page on Baseten. ### *function* `truss_chains.make_abs_path_here` Helper to specify file paths relative to the *immediately calling* module. E.g. in you have a project structure like this: ```default theme={"system"} root/ chain.py common_requirements.text sub_package/ chainlet.py chainlet_requirements.txt ``` You can now in `root/sub_package/chainlet.py` point to the requirements file like this: ```python theme={"system"} shared = make_abs_path_here("../common_requirements.text") specific = make_abs_path_here("chainlet_requirements.text") ``` This helper uses the directory of the immediately calling module as an absolute reference point for resolving the file location. Therefore, you MUST NOT wrap the instantiation of `make_abs_path_here` into a function (e.g. applying decorators) or use dynamic code execution. Ok: ```python theme={"system"} def foo(path: AbsPath): abs_path = path.abs_path foo(make_abs_path_here("./somewhere")) ``` Not Ok: ```python theme={"system"} def foo(path: str): dangerous_value = make_abs_path_here(path).abs_path foo("./somewhere") ``` **Parameters:** | Name | Type | Description | | ----------- | ----- | -------------------------- | | `file_path` | *str* | Absolute or relative path. | * **Returns:** *AbsPath* ### *function* `truss_chains.run_local` Context manager local debug execution of a chain. The arguments only need to be provided if the chainlets explicitly access any the corresponding fields of [`DeploymentContext`](#class-truss-chains-deploymentcontext). **Parameters:** | Name | Type | Default | Description | | --------------------- | ------------------------------------------------------------------------------------------ | ------- | -------------------------------------------------------------- | | `secrets` | *Mapping\[str,str]\|None* | `None` | A dict of secrets keys and values to provide to the chainlets. | | `data_dir` | *Path\|str\|None* | `None` | Path to a directory with data files. | | `chainlet_to_service` | *Mapping\[str,[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | `None` | A dict of chainlet names to service descriptors. | Example usage (as trailing main section in a chain file): ```python theme={"system"} import os import truss_chains as chains class HelloWorld(chains.ChainletBase): ... if __name__ == "__main__": with chains.run_local( secrets={"some_token": os.environ["SOME_TOKEN"]}, chainlet_to_service={ "SomeChainlet": chains.DeployedServiceDescriptor( name="SomeChainlet", display_name="SomeChainlet", predict_url="https://...", options=chains.RPCOptions(), ) }, ): hello_world_chain = HelloWorld() result = hello_world_chain.run_remote(max_value=5) print(result) ``` Refer to the [local debugging guide](/development/chain/localdev) for more details. ### *class* `truss_chains.DeployedServiceDescriptor` Bases: `pydantic.BaseModel` Bundles values to establish an RPC session to a dependency chainlet, specifically with `StubBase`. **Parameters:** | Name | Type | Default | | -------------- | ---------------------------------------------- | ------- | | `name` | *str* | | | `display_name` | *str* | | | `options` | *[RPCOptions](#class-truss-chains-rpcoptions)* | | | `predict_url` | *str\|None* | `None` | | `internal_url` | *InternalURL* | `None` | ### *class* `truss_chains.StubBase` Bases: `BasetenSession`, `ABC` Base class for stubs that invoke remote chainlets. Extends `BasetenSession` with methods for data serialization, de-serialization and invoking other endpoints. It is used internally for RPCs to dependency chainlets, but it can also be used in user-code for wrapping a deployed truss model into the Chains framework. It flexibly supports JSON and pydantic inputs and output. Example usage: ```python theme={"system"} import pydantic import truss_chains as chains class WhisperOutput(pydantic.BaseModel): ... class DeployedWhisper(chains.StubBase): # Input JSON, output JSON. async def run_remote(self, audio_b64: str) -> Any: return await self.predict_async( inputs={"audio": audio_b64}) # resp == {"text": ..., "language": ...} # OR Input JSON, output pydantic model. async def run_remote(self, audio_b64: str) -> WhisperOutput: return await self.predict_async( inputs={"audio": audio_b64}, output_model=WhisperOutput) # OR Input and output are pydantic models. async def run_remote(self, data: WhisperInput) -> WhisperOutput: return await self.predict_async(data, output_model=WhisperOutput) class MyChainlet(chains.ChainletBase): def __init__(self, ..., context=chains.depends_context()): ... self._whisper = DeployedWhisper.from_url( WHISPER_URL, context, options=chains.RPCOptions(retries=3), ) async def run_remote(self, ...): await self._whisper.run_remote(...) ``` **Parameters:** | Name | Type | Description | | -------------------- | ----------------------------------------------------------------------------- | ----------------------------------------- | | `service_descriptor` | *[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | Contains the URL and other configuration. | | `api_key` | *str* | A baseten API key to authorize requests. | #### *classmethod* from\_url(predict\_url, context\_or\_api\_key, options=None) Factory method, convenient to be used in chainlet’s `__init__`-method. **Parameters:** | Name | Type | Description | | -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------------------------ | | `predict_url` | *str* | URL to predict endpoint of another chain / truss model. | | `context_or_api_key` | *[DeploymentContext](#class-truss-chains-deploymentcontext)* | Deployment context object, obtained in the chainlet’s `__init__` or Baseten API key. | | `options` | *[RPCOptions](#class-truss-chains-rpcoptions)* | RPC options, e.g. retries. | #### Invocation Methods * `async predict_async(inputs: PydanticModel, output_model: Type[PydanticModel]) → PydanticModel` * `async predict_async(inputs: JSON, output_model: Type[PydanticModel]) → PydanticModel` * `async predict_async(inputs: JSON) → JSON` * `async predict_async_stream(inputs: PydanticModel | JSON) -> AsyncIterator[bytes]` Deprecated synchronous methods: * `predict_sync(inputs: PydanticModel, output_model: Type[PydanticModel]) → PydanticModel` * `predict_sync(inputs: JSON, output_model: Type[PydanticModel]) → PydanticModel` * `predict_sync(inputs: JSON) → JSON` ### *class* `truss_chains.RemoteErrorDetail` Bases: `pydantic.BaseModel` When a remote chainlet raises an exception, this pydantic model contains information about the error and stack trace and is included in JSON form in the error response. **Parameters:** | Name | Type | | ----------------------- | ------------------- | | `exception_cls_name` | *str* | | `exception_module_name` | *str\|None* | | `exception_message` | *str* | | `user_stack_trace` | *list\[StackFrame]* | #### *method* format() Format the error for printing, similar to how Python formats exceptions with stack traces. * **Returns:** str ### *class* `truss_chains.GenericRemoteException` Bases: `Exception` Raised when calling a remote chainlet results in an error and it is not possible to re-raise the same exception that was raise remotely in the caller. --- # Source: https://docs.baseten.co/reference/inference-api/chat-completions.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chat Completions > Creates a chat completion for the provided conversation. This endpoint is fully compatible with the OpenAI Chat Completions API, allowing you to use standard OpenAI SDKs by changing only the base URL and API key. Download the [OpenAPI schema](/reference/inference-api/llm-openapi-spec.json) for code generation and client libraries. [Model APIs](https://app.baseten.co/model-apis/create) provide instant access to high-performance open-source LLMs through an OpenAI-compatible endpoint. ## Replace OpenAI with Baseten Switching from OpenAI to Baseten takes two changes: the base URL and API key. To switch to Baseten with the Python SDK, change `base_url` and `api_key` when initializing the client: ```python theme={"system"} from openai import OpenAI import os client = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ["BASETEN_API_KEY"], ) response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3.1", messages=[{"role": "user", "content": "Hello!"}], ) ``` To switch to Baseten with the JavaScript SDK, change `baseURL` and `apiKey` when initializing the client: ```javascript theme={"system"} import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY, }); const response = await client.chat.completions.create({ model: "deepseek-ai/DeepSeek-V3.1", messages: [{ role: "user", content: "Hello!" }], }); ``` To call Baseten with cURL, send a POST request to `inference.baseten.co` with your API key: ```bash theme={"system"} curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V3.1", "messages": [{"role": "user", "content": "Hello!"}] }' ``` Deploy a [Model API](https://app.baseten.co/model-apis/create) to get started. For detailed usage guides including structured outputs and tool calling, see [Using Model APIs](/development/model-apis/overview). ## OpenAPI ````yaml reference/inference-api/llm-openapi-spec.json post /v1/chat/completions openapi: 3.1.0 info: title: Baseten LLM Inference API version: 1.0.0 description: >- OpenAI-compatible API for Baseten Model APIs. Use this endpoint to interact with hosted LLMs. servers: - url: https://inference.baseten.co description: Baseten Inference API. security: - ApiKeyAuth: [] paths: /v1/chat/completions: post: tags: - Chat Completions summary: Create a chat completion description: >- Creates a chat completion for the provided conversation. This endpoint is fully compatible with the OpenAI Chat Completions API, allowing you to use standard OpenAI SDKs by changing only the base URL and API key. operationId: createChatCompletion requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/ChatCompletionRequest' responses: '200': description: Successful response content: application/json: schema: $ref: '#/components/schemas/ChatCompletionResponse' '400': description: 'Bad request: invalid parameters.' '401': description: 'Unauthorized: invalid or missing API key.' '429': description: Rate limit exceeded. '500': description: Internal server error. components: schemas: ChatCompletionRequest: additionalProperties: false properties: messages: type: array items: $ref: '#/components/schemas/ChatCompletionMessage' description: >- A list of messages representing the conversation history. Supports roles: `system`, `user`, `assistant`, and `tool`. model: title: Model type: string description: >- The model slug to use for completion, such as `deepseek-ai/DeepSeek-V3.1`. Find available models at [Model APIs](https://app.baseten.co/model-apis/create). frequency_penalty: default: 0 title: Frequency Penalty description: >- Penalizes tokens based on how frequently they appear in the text so far. Positive values decrease repetition. Support varies by model. type: number logit_bias: default: null title: Logit Bias description: >- A map of token IDs to bias values (-100 to 100). Use this to increase or decrease the likelihood of specific tokens appearing in the output. additionalProperties: type: number type: object logprobs: default: false title: Logprobs description: >- If `true`, returns log probabilities of the output tokens. Log probability support varies by model. type: boolean top_logprobs: default: 0 title: Top Logprobs description: >- Number of most likely tokens to return at each position (0-20). Requires `logprobs: true`. Log probability support varies by model. type: integer max_tokens: default: 4096 maximum: 262144 minimum: 1 title: Max Tokens type: integer description: >- Maximum number of tokens to generate. If your request input plus `max_tokens` exceeds the model's context length, `max_tokens` is truncated. If your request exceeds the context length by more than 16k tokens or if `max_tokens` signals no preference, context reservation is throttled to 49512 tokens. Higher `max_tokens` values slightly deprioritize request scheduling. 'n': default: 1 title: 'N' description: Number of completions to generate. Only `1` is supported. type: integer presence_penalty: default: 0 title: Presence Penalty description: >- Penalizes tokens based on whether they have appeared in the text so far. Positive values encourage the model to discuss new topics. Support varies by model. type: number response_format: anyOf: - $ref: '#/components/schemas/ResponseFormatText' - $ref: '#/components/schemas/ResponseFormatJson' - $ref: '#/components/schemas/ResponseFormatJsonObject' - $ref: '#/components/schemas/ResponseFormatGrammar' - $ref: '#/components/schemas/ResponseFormatStructuralTag' default: null title: Response Format description: >- Specifies the output format. Use `{"type": "json_object"}` for JSON mode, or `{"type": "json_schema", "json_schema": {...}}` for structured outputs with a specific schema. seed: default: null title: Seed description: >- Random seed for deterministic generation. Determinism is not guaranteed across different hardware or model versions. type: integer stop: anyOf: - maxLength: 1000 minLength: 1 type: string - items: maxLength: 1000 minLength: 1 type: string maxItems: 32 type: array title: Stop description: >- Up to 32 sequences where the API stops generating further tokens. Can be a string or array of strings. stream: default: false title: Stream description: >- If `true`, responses are streamed back as server-sent events (SSE) as they are generated. type: boolean stream_options: $ref: '#/components/schemas/StreamOptions' default: null description: >- Options for streaming responses. Set `include_usage: true` to receive token usage statistics in the final chunk. temperature: default: null title: Temperature description: >- Controls randomness in the output. Lower values like 0.2 produce more focused and deterministic responses. Higher values like 1.5 produce more creative and varied output. maximum: 4 minimum: 0 type: number top_p: default: 1 title: Top P description: >- Nucleus sampling: only consider tokens with cumulative probability up to this value. Lower values like 0.1 produce more focused output. exclusiveMinimum: 0 maximum: 1 type: number tools: default: null title: Tools description: >- A list of tools (functions) the model may call. Each tool should have a `type: "function"` and a `function` object with `name`, `description`, and `parameters`. items: $ref: '#/components/schemas/ChatCompletionToolsParam' type: array tool_choice: anyOf: - enum: - none - required - auto type: string - $ref: '#/components/schemas/ChatCompletionNamedToolChoiceParam' default: null title: Tool Choice description: >- Controls which tool (if any) the model calls. - `none`: Never call a tool. - `auto`: Model decides whether to call a tool. - `required`: Model must call at least one tool. - `{"type": "function", "function": {"name": "..."}}`: Call a specific function. parallel_tool_calls: default: true title: Parallel Tool Calls description: If `true`, the model can call multiple tools in a single response. type: boolean user: default: null title: User description: >- A unique identifier for the end-user, useful for tracking and abuse detection. type: string best_of: default: null title: Best Of description: >- Number of candidate sequences to generate and return the best from. Only a value of 1 is supported. maximum: 1 minimum: 1 type: integer top_k: default: 50 title: Top K description: >- Limits token selection to the top K most probable tokens at each step. Lower values like 10 produce more focused output. Set to -1 to disable. type: integer top_p_min: default: 0 title: Top P Min type: number description: >- Minimum value for dynamic `top_p`. When set, `top_p` dynamically adjusts but does not go below this value. min_p: default: 0 title: Min P type: number description: >- Minimum probability threshold for token selection. Filters out tokens with probability below `min_p * max_probability`. repetition_penalty: default: 1 title: Repetition Penalty type: number description: >- Multiplicative penalty for repeated tokens. Values greater than 1.0 discourage repetition, values less than 1.0 encourage it. length_penalty: default: 1 title: Length Penalty type: number description: >- Exponential penalty applied to sequence length during beam search. Values greater than 1.0 favor longer sequences. early_stopping: default: false title: Early Stopping type: boolean description: >- If `true`, stops generation when at least `n` complete candidates are found. bad: anyOf: - type: string - items: type: string type: array title: Bad description: Words or phrases to avoid in the output. Support varies by model. bad_token_ids: title: Bad Token Ids description: Token IDs to avoid in the output. Support varies by model. items: type: integer type: array stop_token_ids: title: Stop Token Ids description: List of token IDs that cause generation to stop when encountered. items: type: integer type: array include_stop_str_in_output: default: false title: Include Stop Str In Output type: boolean description: If `true`, includes the matched stop string in the output. ignore_eos: default: false title: Ignore Eos type: boolean description: If `true`, continues generating past the end-of-sequence token. min_tokens: default: 0 title: Min Tokens type: integer description: >- Minimum number of tokens to generate before stopping. Useful for ensuring responses are not too short. skip_special_tokens: default: true title: Skip Special Tokens type: boolean description: If `true`, removes special tokens from the generated output. spaces_between_special_tokens: default: true title: Spaces Between Special Tokens type: boolean description: If `true`, adds spaces between special tokens in the output. truncate_prompt_tokens: default: null title: Truncate Prompt Tokens description: >- If set, truncates the prompt to this many tokens. Useful for handling inputs that may exceed context limits. minimum: 1 type: integer echo: default: false description: >- If `true` and the last message role matches the generation role, prepends that message to the output. title: Echo type: boolean add_generation_prompt: default: true description: >- If `true`, adds the generation prompt from the chat template, such as `<|assistant|>`. Set to `false` for completion-style generation. title: Add Generation Prompt type: boolean add_special_tokens: default: false description: >- If `true`, adds special tokens like BOS to the prompt beyond what the chat template adds. For most models, the chat template handles special tokens, so this should be `false`. title: Add Special Tokens type: boolean documents: default: null description: >- A list of documents for RAG (retrieval-augmented generation). Each document is a dict with string keys and values that the model can reference. title: Documents items: additionalProperties: type: string type: object type: array chat_template: default: null description: >- A custom Jinja template for formatting the conversation. If not provided, uses the model's default template. title: Chat Template type: string chat_template_args: default: null description: Additional arguments to pass to the chat template renderer. title: Chat Template Args additionalProperties: true type: object disaggregated_params: $ref: '#/components/schemas/DisaggregatedParams' default: null description: >- Advanced parameters for disaggregated serving. Used internally for distributed inference. required: - messages - model title: ChatCompletionRequest type: object description: Request body for creating a chat completion. ChatCompletionResponse: additionalProperties: false properties: id: title: Id type: string description: A unique identifier for the chat completion. object: const: chat.completion.chunk default: chat.completion.chunk title: Object type: string description: >- The object type, always `chat.completion` or `chat.completion.chunk` for streaming. created: title: Created type: integer description: The Unix timestamp (in seconds) of when the completion was created. model: title: Model type: string description: The model used for the completion. choices: items: $ref: '#/components/schemas/ChatCompletionResponseStreamChoice' title: Choices type: array description: A list of chat completion choices. usage: $ref: '#/components/schemas/UsageInfo' default: null description: >- Token usage statistics for the request. Only present when streaming with `stream_options.include_usage: true`. required: - model - choices title: ChatCompletionStreamResponse type: object description: A chat completion response returned by the model. ChatCompletionMessage: type: object required: - role description: >- A message in the conversation. Supports roles: `system`, `user`, `assistant`, and `tool`. properties: role: type: string enum: - system - user - assistant - tool description: >- The role of the message author: `system` (instructions), `user` (input), `assistant` (model response), or `tool` (tool result). content: anyOf: - type: string - type: array items: anyOf: - $ref: '#/components/schemas/ChatCompletionContentPartTextParam' - $ref: '#/components/schemas/ChatCompletionContentPartImageParam' - $ref: >- #/components/schemas/ChatCompletionContentPartInputAudioParam description: >- The message content. Can be a string or an array of content parts (text, image, audio) for multimodal inputs. name: type: string description: >- An optional name for the participant. Useful for distinguishing between multiple users or assistants. tool_calls: type: array items: $ref: '#/components/schemas/ChatCompletionMessageToolCallParam' description: Tool calls generated by the model (for assistant messages). tool_call_id: type: string description: >- The ID of the tool call this message responds to (required for tool messages). ResponseFormatText: additionalProperties: false properties: type: const: text title: Type type: string description: The response format type, always `text`. required: - type title: ResponseFormatText type: object description: Plain text response format. ResponseFormatJson: additionalProperties: false properties: type: const: json_schema title: Type type: string description: The response format type, always `json_schema`. json_schema: $ref: '#/components/schemas/JsonSchema' description: The JSON schema definition. required: - type - json_schema title: ResponseFormatJson type: object description: JSON schema response format for structured outputs. ResponseFormatJsonObject: additionalProperties: false properties: type: const: json_object title: Type type: string description: The response format type, always `json_object`. required: - type title: ResponseFormatJsonObject type: object description: JSON object response format. ResponseFormatGrammar: additionalProperties: false properties: type: const: grammar title: Type type: string description: The response format type, always `grammar`. grammar: title: Grammar type: string description: The grammar definition string. required: - type - grammar title: ResponseFormatGrammar type: object description: Grammar-based response format. ResponseFormatStructuralTag: additionalProperties: false properties: type: const: structural_tag title: Type type: string description: The response format type, always `structural_tag`. structural_tag: title: Structural Tag type: string description: The structural tag definition. required: - type - structural_tag title: ResponseFormatStructuralTag type: object description: Structural tag response format. StreamOptions: additionalProperties: false properties: include_usage: default: true title: Include Usage description: >- If `true`, includes token usage statistics in the final streaming chunk. type: boolean continuous_usage_stats: default: true title: Continuous Usage Stats description: >- If `true`, includes running token usage statistics in each streaming chunk. type: boolean title: StreamOptions type: object description: Options for streaming responses. ChatCompletionToolsParam: additionalProperties: false properties: type: const: function default: function title: Type type: string description: The type of tool, always `function`. function: $ref: '#/components/schemas/FunctionDefinition' description: The function definition. required: - function title: ChatCompletionToolsParam type: object description: A tool that the model can call. ChatCompletionNamedToolChoiceParam: additionalProperties: false properties: function: $ref: '#/components/schemas/ChatCompletionNamedFunction' description: The function to call. type: const: function default: function title: Type type: string description: The type, always `function`. required: - function title: ChatCompletionNamedToolChoiceParam type: object description: Forces the model to call a specific function. DisaggregatedParams: additionalProperties: false properties: request_type: title: Request Type type: string description: The type of disaggregated request. first_gen_tokens: default: null title: First Gen Tokens description: First generation tokens for continuation. items: type: integer type: array ctx_request_id: default: null title: Ctx Request Id description: Context request identifier. type: integer opaque_state: default: null title: Opaque State description: Opaque state for continuation. type: string draft_tokens: default: null title: Draft Tokens description: Draft tokens for speculative decoding. items: type: integer type: array multimodal_embedding_handles: default: null title: Multimodal Embedding Handles description: Handles for multimodal embeddings. items: additionalProperties: true type: object type: array multimodal_hashes: default: null title: Multimodal Hashes description: Hashes for multimodal content. items: items: type: integer type: array type: array required: - request_type title: DisaggregatedParams type: object description: Advanced parameters for disaggregated serving. Used internally. ChatCompletionResponseStreamChoice: additionalProperties: false properties: index: title: Index type: integer description: The index of this choice in the list of choices. delta: $ref: '#/components/schemas/DeltaMessage' description: The delta content for streaming responses. logprobs: $ref: '#/components/schemas/ChatCompletionLogProbs' default: null description: Log probability information for the choice. finish_reason: default: null title: Finish Reason description: >- The reason the model stopped generating: `stop` (natural stop or stop sequence), `length` (max tokens reached), or `tool_calls` (model called a tool). type: string stop_reason: anyOf: - type: integer - type: string default: null title: Stop Reason description: >- The specific stop sequence or token ID that caused generation to stop. required: - index - delta title: ChatCompletionResponseStreamChoice type: object description: A choice in the chat completion response. UsageInfo: additionalProperties: false properties: completion_tokens: default: 0 title: Completion Tokens type: integer description: Number of tokens in the generated completion. prompt_tokens: default: 0 title: Prompt Tokens type: integer description: Number of tokens in the prompt. total_tokens: default: 0 title: Total Tokens type: integer description: Total number of tokens used (prompt + completion). completion_tokens_details: $ref: '#/components/schemas/CompletionTokensDetails' description: Breakdown of completion token usage. prompt_tokens_details: $ref: '#/components/schemas/PromptTokensDetails' description: Breakdown of prompt token usage. title: UsageInfo type: object description: Token usage statistics for the request. ChatCompletionContentPartTextParam: additionalProperties: false properties: type: const: text title: Type type: string description: The content type, always `text`. text: title: Text type: string description: The text content. required: - type - text title: ChatCompletionContentPartTextParam type: object description: Text content part. ChatCompletionContentPartImageParam: additionalProperties: false properties: type: const: image_url title: Type type: string description: The content type, always `image_url`. image_url: $ref: '#/components/schemas/ImageURL' description: The image URL and detail settings. required: - type - image_url title: ChatCompletionContentPartImageParam type: object description: Image content part for vision models. ChatCompletionContentPartInputAudioParam: additionalProperties: false properties: type: const: input_audio title: Type type: string description: The content type, always `input_audio`. input_audio: $ref: '#/components/schemas/InputAudio' description: The audio data and format. required: - type - input_audio title: ChatCompletionContentPartInputAudioParam type: object description: Audio content part for audio-capable models. ChatCompletionMessageToolCallParam: additionalProperties: false properties: id: title: Id type: string description: The ID of the tool call. index: title: Index description: The index of the tool call. type: integer function: $ref: '#/components/schemas/Function' description: The function that was called. type: const: function title: Type type: string description: The type, always `function`. required: - id - function - type title: ChatCompletionMessageToolCallParam type: object description: A tool call in an assistant message. JsonSchema: additionalProperties: false properties: name: title: Name type: string description: The name of the schema. description: default: null title: Description description: A description of the schema. type: string schema: additionalProperties: true title: Schema type: object description: The JSON Schema definition. strict: default: true title: Strict description: If `true`, enables strict schema adherence. const: true type: boolean required: - name - schema title: JsonSchema type: object description: A JSON schema for structured output. FunctionDefinition: additionalProperties: false properties: name: title: Name type: string description: The name of the function. description: default: null title: Description description: A description of what the function does. type: string parameters: default: null title: Parameters description: The parameters the function accepts, as a JSON Schema object. additionalProperties: true type: object strict: default: false title: Strict description: If `true`, enables strict schema adherence. type: boolean required: - name title: FunctionDefinition type: object description: A function definition that the model can call. ChatCompletionNamedFunction: additionalProperties: false properties: name: title: Name type: string description: The name of the function to call. required: - name title: ChatCompletionNamedFunction type: object description: Specifies a function to call by name. DeltaMessage: additionalProperties: false properties: role: default: null title: Role description: The role of the message author (typically `assistant`). type: string content: default: null title: Content description: The content chunk generated by the model. type: string tool_calls: items: $ref: '#/components/schemas/ToolCall' title: Tool Calls type: array description: Tool calls generated by the model. title: DeltaMessage type: object description: A delta message chunk in a streaming response. ChatCompletionLogProbs: additionalProperties: false properties: content: default: null title: Content description: A list of log probability information for each token in the content. items: $ref: '#/components/schemas/ChatCompletionLogProbsContent' type: array title: ChatCompletionLogProbs type: object description: Log probability information for the completion. CompletionTokensDetails: additionalProperties: false properties: accepted_prediction_tokens: default: 0 title: Accepted Prediction Tokens type: integer description: Number of tokens in accepted predictions (for speculative decoding). audio_tokens: default: 0 title: Audio Tokens type: integer description: Number of audio tokens generated. reasoning_tokens: default: 0 title: Reasoning Tokens type: integer description: >- Number of tokens used for reasoning (for models that support extended thinking). rejected_prediction_tokens: default: 0 title: Rejected Prediction Tokens type: integer description: Number of tokens in rejected predictions (for speculative decoding). title: CompletionTokensDetails type: object description: Breakdown of tokens used in the completion. PromptTokensDetails: additionalProperties: false properties: audio_tokens: default: 0 title: Audio Tokens type: integer description: Number of audio tokens in the prompt. cached_tokens: default: 0 title: Cached Tokens type: integer description: Number of tokens retrieved from cache. title: PromptTokensDetails type: object description: Breakdown of tokens used in the prompt. ImageURL: additionalProperties: false properties: url: title: Url type: string description: The URL of the image, or a base64-encoded data URL. detail: default: null title: Detail description: >- The detail level: `auto` (default), `low` (512px max), or `high` (full resolution). enum: - auto - low - high type: string required: - url title: ImageURL type: object description: An image URL with optional detail settings. InputAudio: additionalProperties: false properties: data: title: Data type: string description: Base64-encoded audio data. format: enum: - wav - mp3 title: Format type: string description: 'The audio format: `wav` or `mp3`.' required: - data - format title: InputAudio type: object description: Audio input data. Function: additionalProperties: false properties: arguments: anyOf: - type: string - additionalProperties: true type: object title: Arguments description: The function arguments as a JSON string or object. name: title: Name type: string description: The name of the function. required: - arguments - name title: Function type: object description: >- The arguments to call the function with, as generated by the model in JSON format. The model may not always generate valid JSON and may hallucinate parameters not defined by your function schema. Validate the arguments in your code before calling your function. ToolCall: additionalProperties: false properties: index: title: Index type: integer description: The index of this tool call in the list of tool calls. id: title: Id type: string description: A unique identifier for this tool call. type: const: function default: function title: Type type: string description: The type of tool call (always `function`). function: $ref: '#/components/schemas/FunctionCall' description: The function that the model called. required: - index - function title: ToolCall type: object description: A tool call generated by the model. ChatCompletionLogProbsContent: additionalProperties: false properties: token: title: Token type: string description: The token string. logprob: default: -9999 title: Logprob type: number description: The log probability of the token. bytes: default: null title: Bytes description: The UTF-8 byte representation of the token. items: type: integer type: array top_logprobs: default: null title: Top Logprobs description: >- List of the most likely tokens and their log probabilities at this position. items: $ref: '#/components/schemas/ChatCompletionLogProb' type: array required: - token title: ChatCompletionLogProbsContent type: object description: Log probability information for a token in the content. FunctionCall: additionalProperties: false properties: name: default: null title: Name description: The name of the function to call. type: string arguments: title: Arguments type: string description: The arguments to call the function with, as a JSON string. required: - arguments title: FunctionCall type: object description: >- The name and arguments of a function that should be called, as generated by the model. ChatCompletionLogProb: additionalProperties: false properties: token: title: Token type: string description: The token string. logprob: default: -9999 title: Logprob type: number description: The log probability of the token. bytes: default: null title: Bytes description: The UTF-8 byte representation of the token. items: type: integer type: array required: - token title: ChatCompletionLogProb type: object description: Log probability information for a token. securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- Use `Api-Key` as the scheme in the Authorization header: `Authorization: Api-Key YOUR_API_KEY`. ```` --- # Source: https://docs.baseten.co/training/concepts/checkpointing.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Checkpointing > Learn how to use Baseten's checkpointing feature to manage model checkpoints and avoid disk errors during training. With checkpointing enabled, you can manage your model checkpoints seamlessly and avoid common training issues. ## Benefits of Checkpointing * **Avoid catastrophic out of disk errors**: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run. * **Maximize GPU utilization**: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training. * **Seamless checkpoint management**: Checkpoints are automatically uploaded to cloud storage for easy access and management. ## Enabling Checkpointing To enable checkpointing, add a `CheckpointingConfig` to the `Runtime` and set `enabled` to `True`: ```python theme={"system"} from truss_train import definitions training_runtime = definitions.Runtime( # ... other configuration options checkpointing_config=definitions.CheckpointingConfig(enabled=True) ) ``` ## Using the Checkpoint Directory Baseten will automatically export the [`$BT_CHECKPOINT_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable in your job's environment. **Write your checkpoints to the `$BT_CHECKPOINT_DIR` directory so Baseten can automatically backup and preserve them.** ## Serving Checkpoints Once your training is complete, you can serve your model checkpoints using Baseten's serving infrastructure. Learn more about [serving checkpoints](/training/deployment). When you delete a job or project, all undeployed checkpoints are permanently deleted with no archival or recovery option. Deployed checkpoints aren't affected. See [Management](/training/management) for details. --- # Source: https://docs.baseten.co/reference/cli/truss/cleanup.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # truss cleanup > Clean up Truss data. ```sh theme={"system"} truss cleanup [OPTIONS] ``` Clears temporary directories created by Truss for operations like building Docker images. Use this to free up disk space. **Example:** To clean up temporary Truss data, use the following: ```sh theme={"system"} truss cleanup ``` --- # Source: https://docs.baseten.co/development/model/code-first-development.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Python driven configuration for models 🆕 > Use code-first development tools to streamline model production. This feature is still in beta. In addition to our normal YAML configuration, we support configuring your model using pure Python. This offers the following benefits: * **Typed configuration via Python code** with IDE autocomplete, instead of a separate `yaml` configuration file * **Simpler directory structure** that IDEs support for module resolution In this guide, we go through deploying a simple Model using this new framework. ### Step 1: Initializing your project We leverage traditional `truss init` functionality with a new flag to create the directory structure: ```bash theme={"system"} truss init my-new-model --python-config ``` ### Step 2: Write your model To build a model with this new framework, we require two things: * A class that inherits from `baseten.ModelBase`, which will serve as the entrypoint when invoking `/predict` * A `predict` method with type hints That’s it! The following is a contrived example of a complete model that will keep a running total of user provided input: ```python my_model.py theme={"system"} import truss_chains as baseten class RunningTotalCalculator(baseten.ModelBase): def __init__(self): self._running_total = 0 async def predict(self, increment: int) -> int: self._running_total += increment return self._running_total ``` ### Step 3: Deploy, patch, and publish your model In order to deploy a development version of your new model with live reload, you can run: ```bash theme={"system"} truss push my_model.py --watch ``` Please note that `push` (as well as all other commands below) will require that you pass the path to the file containing the model as the final argument. This new workflow also supports patching, so you can quickly iterate during development without building new images every time. ```bash theme={"system"} truss watch my_model.py ``` To deploy a production-ready version, use: ```bash theme={"system"} truss push my_model.py --publish ``` ### Model Configuration Models can configure requirements for compute hardware (CPU count, GPU type and count, etc) and software dependencies (Python libraries or system packages) via the [`remote_config`](/reference/sdk/chains#remote-configuration) class variable within the model: ```python my_model.py theme={"system"} class RunningTotalCalculator(baseten.ModelBase): remote_config: baseten.RemoteConfig = baseten.RemoteConfig( compute=baseten.Compute(cpu_count=4, memory="1Gi", gpu="T4", gpu_count=2) ) ... ``` See the [remote configuration reference](/reference/sdk/chains#remote-configuration) for a complete list of options. ### Context (access information) You can add [`DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext) object as an optional final argument to the **`__init__`**-method of a Model. This allows you to use secrets within your Model, but note that they’ll also need to be added to the **`assets`**. We only expose secrets to the model that were explicitly requested in `assets` to comply with best security practices. ```python my_model.py theme={"system"} class RunningTotalCalculator(baseten.ModelBase): remote_config: baseten.RemoteConfig = baseten.RemoteConfig( ... assets=baseten.Assets(secret_keys=["token"]) ) def __init__(self, context: baseten.DeploymentContext = baseten.depends_context()): ... self._token = context.secrets["token"] ``` ### Packages If you want to include modules in your model, you can easily create them from the root of the project: ```bash theme={"system"} my-new-model/ module_1/ submodule/ script.py module_2/ another_script.py my_model.py ``` With this file structure, you would import in `my_model.py` as follows: ```python my_model.py theme={"system"} import truss_chains as baseten from module_1.submodule import script from module_2 import another_script class RunningTotalCalculator(baseten.ModelBase): .... ``` ### Known Limitations * RemoteConfig does *not* support all the options exposed by the traditional `config.yaml`. If you’re excited about this new development experience but need a specific feature ported over, please reach out to us! * This new framework does not support `preprocess` or `postprocess` hooks. We typically recommend inlining functionality from those functions if easy, or utilizing `chains` if the needs are more complex. --- # Source: https://docs.baseten.co/examples/comfyui.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy a ComfyUI project > Deploy your ComfyUI workflow as an API endpoint In this example, we'll deploy an **anime style transfer** ComfyUI workflow using truss. This example won't require any Python code, but there are a few pre-requisites in order to get started. Pre-Requisites: 1. Convert your ComfyUI workflow to an **API compatible JSON format**. The regular JSON format that is used to export Comfy workflows will not work here. 2. Have a list of the models your workflow requires along with URLs to where each model can be downloaded ## Setup Clone the truss-examples repository and navigate to the `comfyui-truss` directory ```bash theme={"system"} git clone https://github.com/basetenlabs/truss-examples.git cd truss-examples/comfyui-truss ``` This repository already contains all the files we need to deploy our ComfyUI workflow. There are just two files we need to modify: 1. `config.yaml` 2. `data/comfy_ui_workflow.json` ## Setting up the `config.yaml` ```yaml theme={"system"} build_commands: - git clone https://github.com/comfyanonymous/ComfyUI.git - cd ComfyUI && git checkout b1fd26fe9e55163f780bf9e5f56bf9bf5f035c93 && pip install -r requirements.txt - cd ComfyUI/custom_nodes && git clone https://github.com/LykosAI/ComfyUI-Inference-Core-Nodes --recursive && cd ComfyUI-Inference-Core-Nodes && pip install -e .[cuda12] - cd ComfyUI/custom_nodes && git clone https://github.com/ZHO-ZHO-ZHO/ComfyUI-Gemini --recursive && cd ComfyUI-Gemini && pip install -r requirements.txt - cd ComfyUI/custom_nodes && git clone https://github.com/kijai/ComfyUI-Marigold --recursive && cd ComfyUI-Marigold && pip install -r requirements.txt - cd ComfyUI/custom_nodes && git clone https://github.com/omar92/ComfyUI-QualityOfLifeSuit_Omar92 --recursive - cd ComfyUI/custom_nodes && git clone https://github.com/Fannovel16/comfyui_controlnet_aux --recursive && cd comfyui_controlnet_aux && pip install -r requirements.txt - cd ComfyUI/models/controlnet && wget -O control-lora-canny-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-canny-rank256.safetensors - cd ComfyUI/models/controlnet && wget -O control-lora-depth-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-depth-rank256.safetensors - cd ComfyUI/models/checkpoints && wget -O dreamshaperXL_v21TurboDPMSDE.safetensors https://civitai.com/api/download/models/351306 - cd ComfyUI/models/loras && wget -O StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors https://huggingface.co/artificialguybr/StudioGhibli.Redmond-V2/resolve/main/StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors environment_variables: {} external_package_dirs: [] model_metadata: {} model_name: Anime Style Transfer python_version: py310 requirements: - websocket-client - accelerate - opencv-python resources: accelerator: H100 use_gpu: true secrets: {} system_packages: - wget - ffmpeg - libgl1-mesa-glx ``` The main part that needs to get filled out is under `build_commands`. Build commands are shell commands that get run during the build stage of the docker image. In this example, the first two lines clone the ComfyUI repository and install the python requirements. The latter commands install various custom nodes and models and place them in their respective directory within the ComfyUI repository. ## Modifying `data/comfy_ui_workflow.json` The `comfy_ui_workflow.json` contains the entire ComfyUI workflow in an API compatible format. This is the workflow that will get executed by the ComfyUI server. Here is the workflow we will be using for this example. ```json theme={"system"} { "1": { "inputs": { "ckpt_name": "dreamshaperXL_v21TurboDPMSDE.safetensors" }, "class_type": "CheckpointLoaderSimple", "_meta": { "title": "Load Checkpoint" } }, "3": { "inputs": { "image": "{{input_image}}", "upload": "image" }, "class_type": "LoadImage", "_meta": { "title": "Load Image" } }, "4": { "inputs": { "text": [ "160", 0 ], "clip": [ "154", 1 ] }, "class_type": "CLIPTextEncode", "_meta": { "title": "CLIP Text Encode (Prompt)" } }, "12": { "inputs": { "strength": 0.8, "conditioning": [ "131", 0 ], "control_net": [ "13", 0 ], "image": [ "71", 0 ] }, "class_type": "ControlNetApply", "_meta": { "title": "Apply ControlNet" } }, "13": { "inputs": { "control_net_name": "control-lora-canny-rank256.safetensors" }, "class_type": "ControlNetLoader", "_meta": { "title": "Load ControlNet Model" } }, "15": { "inputs": { "strength": 0.8, "conditioning": [ "12", 0 ], "control_net": [ "16", 0 ], "image": [ "18", 0 ] }, "class_type": "ControlNetApply", "_meta": { "title": "Apply ControlNet" } }, "16": { "inputs": { "control_net_name": "control-lora-depth-rank256.safetensors" }, "class_type": "ControlNetLoader", "_meta": { "title": "Load ControlNet Model" } }, "18": { "inputs": { "seed": 995352869972963, "denoise_steps": 4, "n_repeat": 10, "regularizer_strength": 0.02, "reduction_method": "median", "max_iter": 5, "tol": 0.001, "invert": true, "keep_model_loaded": true, "n_repeat_batch_size": 2, "use_fp16": true, "scheduler": "LCMScheduler", "normalize": true, "model": "marigold-lcm-v1-0", "image": [ "3", 0 ] }, "class_type": "MarigoldDepthEstimation", "_meta": { "title": "MarigoldDepthEstimation" } }, "19": { "inputs": { "images": [ "71", 0 ] }, "class_type": "PreviewImage", "_meta": { "title": "Preview Image" } }, "20": { "inputs": { "images": [ "18", 0 ] }, "class_type": "PreviewImage", "_meta": { "title": "Preview Image" } }, "21": { "inputs": { "seed": 358881677137626, "steps": 20, "cfg": 7, "sampler_name": "dpmpp_2m_sde", "scheduler": "karras", "denoise": 0.7000000000000001, "model": [ "154", 0 ], "positive": [ "15", 0 ], "negative": [ "4", 0 ], "latent_image": [ "25", 0 ] }, "class_type": "KSampler", "_meta": { "title": "KSampler" } }, "25": { "inputs": { "pixels": [ "70", 0 ], "vae": [ "1", 2 ] }, "class_type": "VAEEncode", "_meta": { "title": "VAE Encode" } }, "27": { "inputs": { "samples": [ "21", 0 ], "vae": [ "1", 2 ] }, "class_type": "VAEDecode", "_meta": { "title": "VAE Decode" } }, "70": { "inputs": { "upscale_method": "lanczos", "megapixels": 1, "image": [ "3", 0 ] }, "class_type": "ImageScaleToTotalPixels", "_meta": { "title": "ImageScaleToTotalPixels" } }, "71": { "inputs": { "low_threshold": 50, "high_threshold": 150, "resolution": 1024, "image": [ "3", 0 ] }, "class_type": "CannyEdgePreprocessor", "_meta": { "title": "Canny Edge" } }, "123": { "inputs": { "images": [ "27", 0 ] }, "class_type": "PreviewImage", "_meta": { "title": "Preview Image" } }, "131": { "inputs": { "text": [ "159", 0 ], "clip": [ "154", 1 ] }, "class_type": "CLIPTextEncode", "_meta": { "title": "CLIP Text Encode (Prompt)" } }, "152": { "inputs": { "text": "{{prompt}}" }, "class_type": "Text _O", "_meta": { "title": "Text_1" } }, "154": { "inputs": { "lora_name": "StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors", "strength_model": 0.6, "strength_clip": 1, "model": [ "1", 0 ], "clip": [ "1", 1 ] }, "class_type": "LoraLoader", "_meta": { "title": "Load LoRA" } }, "156": { "inputs": { "text_1": [ "152", 0 ], "text_2": [ "158", 0 ] }, "class_type": "ConcatText_Zho", "_meta": { "title": "✨ConcatText_Zho" } }, "157": { "inputs": { "text": "StdGBRedmAF,Studio Ghibli," }, "class_type": "Text _O", "_meta": { "title": "Text _2" } }, "158": { "inputs": { "text": "looking at viewer, anime artwork, anime style, key visual, vibrant, studio anime, highly detailed" }, "class_type": "Text _O", "_meta": { "title": "Text _O" } }, "159": { "inputs": { "text_1": [ "156", 0 ], "text_2": [ "157", 0 ] }, "class_type": "ConcatText_Zho", "_meta": { "title": "✨ConcatText_Zho" } }, "160": { "inputs": { "text": "photo, deformed, black and white, realism, disfigured, low contrast" }, "class_type": "Text _O", "_meta": { "title": "Text _O" } } } ``` **Important:** If you look at the JSON file above, you'll notice we have templatized a few items using the **`{{handlebars}}`** templating style. If there are any inputs in your ComfyUI workflow that should be variables such as input prompts, images, etc, you should templatize them using the handlebars format. In this example workflow, there are two inputs: **`{{input_image}}`** and **`{{prompt}}`** When making an API call to this workflow, we will be able to pass in any variable for these two inputs. ## Deploying the Workflow to Baseten Once you have both your `config.yaml` and `data/comfy_ui_workflow.json` filled out we can deploy this workflow just like any other model on Baseten. 1. `pip install truss --upgrade` 2. `truss push --publish` ## Running Inference When you deploy the truss, it will spin up a new deployment in your Baseten account. Each deployment will expose a REST API endpoint which we can use to call this workflow. ```python theme={"system"} import requests import os import base64 from PIL import Image from io import BytesIO # Replace the empty string with your model id below model_id = "" baseten_api_key = os.environ["BASETEN_API_KEY"] BASE64_PREAMBLE = "data:image/png;base64," def pil_to_b64(pil_img):    buffered = BytesIO()    pil_img.save(buffered, format="PNG")    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")    return img_str def b64_to_pil(b64_str):    return Image.open(BytesIO(base64.b64decode(b64_str.replace(BASE64_PREAMBLE, "")))) values = {  "prompt": "american Shorthair",  "input_image": {"type": "image", "data": pil_to_b64(Image.open("/path/to/cat.png"))} } resp = requests.post(    f"https://model-{model_id}.api.baseten.co/production/predict",    headers={"Authorization": f"Api-Key {baseten_api_key}"},    json={"workflow_values": values} ) res = resp.json() results = res.get("result") for item in results:    if item.get("format") == "png":        data = item.get("data")        img = b64_to_pil(data)        img.save(f"pet-style-transfer-1.png") ``` If you recall, we templatized two variables in our workflow: `prompt` and `input_image`. In our API call we can specify the values for these two variables like so: ```json theme={"system"} values = {  "prompt": "Maltipoo",  "input_image": {"type": "image", "data": pil_to_b64(Image.open("/path/to/dog.png"))} } ``` If your workflow contains more variables, simply add them to the dictionary above. The API call returns an image in the form of a base64 string, which we convert to a PNG image. --- # Source: https://docs.baseten.co/inference/concepts.md # Source: https://docs.baseten.co/development/concepts.md # Source: https://docs.baseten.co/development/chain/concepts.md # Source: https://docs.baseten.co/deployment/concepts.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Concepts Baseten provides a flexible and scalable infrastructure for deploying and managing machine learning models. This page introduces key concepts - [deployments](/deployment/deployments), [environments](/deployment/environments), [resources](/deployment/resources), and [autoscaling](/deployment/autoscaling) — that shape how models are served, tested, and optimized for performance and cost efficiency. ## Deployments [Deployments](/deployment/deployments) define how models are served, scaled, and updated. They optimize resource use with autoscaling, scaling to zero, and controlled traffic shifts while ensuring minimal downtime. Deployments can be deactivated to pause resource usage or deleted permanently when no longer needed. ## Environments [Environments](/deployment/environments) group deployments, providing stable endpoints and autoscaling to manage model release cycles. They enable structured testing, controlled rollouts, and seamless transitions between staging and production. Each environment maintains its own settings and metrics, ensuring reliable and scalable deployments. ## Resources [Resources](/deployment/resources) define the hardware allocated to a model server, balancing performance and cost. Choosing the right instance type ensures efficient inference without unnecessary overhead. Resources can be set before deployment in Truss or adjusted later in the model dashboard to match workload demands. ## Autoscaling [Autoscaling](/deployment/autoscaling) dynamically adjusts model resources to handle traffic fluctuations efficiently while minimizing costs. Deployments scale between a defined range of replicas based on demand, with settings for concurrency, scaling speed, and scale-to-zero for low-traffic models. Optimizations like network acceleration and cold start pods ensure fast response times even when scaling up from zero. --- # Source: https://docs.baseten.co/development/model/concurrency.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Request concurrency > A guide to setting concurrency for your model Configuring concurrency optimizes **model performance**, balancing **throughput** and **latency**. In Baseten and Truss, concurrency is managed at **two levels**: 1. **Concurrency Target** – Limits the number of requests **sent** to a single replica. 2. **Predict Concurrency** – Limits how many requests the predict function handles **inside the model container**. ## 1. Concurrency Target * **Set in the Baseten UI** – Defines how many requests a single replica can process at once. * **Triggers autoscaling** – If all replicas hit the concurrency target, additional replicas spin up. **Example:** * **Concurrency Target = 2, Single Replica** * **5 requests arrive** → 2 are processed immediately, **3 are queued**. * If max replicas aren't reached, **autoscaling spins up a new replica**. ## 2. Predict Concurrency * **Set in** `config.yaml` – Controls how many requests can be **processed by** predict simultaneously. * **Protects GPU resources** – Prevents multiple requests from overloading the GPU. ### Configuring Predict Concurrency ```yaml config.yaml theme={"system"} model_name: "My model with concurrency limits" runtime: predict_concurrency: 2 # Default is 1 ``` ### How It Works Inside a Model Pod 1. **Requests arrive** → All begin preprocessing (e.g., downloading images from S3). 2. **Predict runs on GPU** → Limited by `predict_concurrency`. 3. **Postprocessing begins** → Can run while other requests are still in inference. ## When to Use Predict Concurrency * ✅ **Protect GPU resources** – Prevent multiple requests from degrading performance. * ✅ **Allow parallel preprocessing/postprocessing** – I/O tasks can continue even when inference is blocked. Ensure `Concurrency Target` is set high enough to send enough requests to the container. --- # Source: https://docs.baseten.co/development/model/configuration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Configuration > How to configure your model. ML models depend on external libraries, data files, and specific hardware configurations. This guide shows you how to configure your model's dependencies and resources. The `config.yaml` file defines your model's configuration. Common options include: # Environment variables To set environment variables in the model serving environment, use the `environment_variables` key: ```yaml config.yaml theme={"system"} environment_variables: MY_ENV_VAR: my_value ``` # Python packages Python packages can be specified in two ways in the `config.yaml` file: 1. `requirements`: A list of Python packages to install. 2. `requirements_file`: A requirements.txt file to install pip packages from. To specify Python packages as a list, use the following: ```yaml config.yaml theme={"system"} requirements: - package_name - package_name2 ``` Pin package versions using the `==` operator: ```yaml config.yaml theme={"system"} requirements: - package_name==1.0.0 - package_name2==2.0.0 ``` If you need more control over the installation process and want to use different pip options or repositories, you can specify a `requirements_file` instead. ```yaml config.yaml theme={"system"} requirements_file: ./requirements.txt ``` # System packages Truss also has support for installing apt-installable Debian packages. To add system packages to your model serving environment, add the following to your `config.yaml` file: ```yaml config.yaml theme={"system"} system_packages: - package_name - package_name2 ``` For example, to install Tesseract OCR: ```yaml config.yaml theme={"system"} system_packages: - tesseract-ocr ``` # Resources Specify hardware resources in the `resources` section. **Option 1: Specify individual resource fields** For a CPU model: ```yaml config.yaml theme={"system"} resources: cpu: "1" memory: 2Gi ``` For a GPU model: ```yaml config.yaml theme={"system"} resources: accelerator: "L4" ``` When you push your model, it will be assigned an instance type matching the specifications required. **Option 2: Specify an exact instance type** ```yaml config.yaml theme={"system"} resources: instance_type: "L4:4x16" ``` Using `instance_type` lets you select an exact SKU. When specified, other resource fields are ignored. See the [Resources](/deployment/resources) page for more information on options available. # Advanced configuration There are numerous other options for configuring your model. See some of the other guides: * [Secrets](/development/model/secrets) * [Data](/development/model/data-directory) * [Custom Build Commands](/development/model/build-commands) * [Base Docker Images](/development/model/base-images) * [Custom Servers](/development/model/custom-server) * [Custom Health Checks](/development/model/custom-health-checks) --- # Source: https://docs.baseten.co/reference/cli/truss/configure.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # truss configure > Configure Truss settings. ```sh theme={"system"} truss configure [OPTIONS] ``` Configures Truss settings interactively. Use this command to set up or modify your local Truss configuration. **Example:** To configure Truss settings interactively, use the following: ```sh theme={"system"} truss configure ``` You should see a configuration file that you can edit, for example: ```yaml ~/.trussrc theme={"system"} [baseten] remote_provider = baseten api_key = YOUR_API_KEY remote_url = https://app.baseten.co ``` --- # Source: https://docs.baseten.co/reference/cli/truss/container.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # truss container > Run and manage Truss containers locally. ```sh theme={"system"} truss container [OPTIONS] COMMAND [ARGS]... ``` Manage Docker containers for your Truss. *** ## `kill` Kill containers related to a specific Truss. ```sh theme={"system"} truss container kill [OPTIONS] [TARGET_DIRECTORY] ``` ### Arguments A Truss directory. Defaults to current directory. **Example:** To kill containers for the current Truss, use the following: ```sh theme={"system"} truss container kill ``` *** ## `kill-all` Kill all Truss containers that are not manually persisted. ```sh theme={"system"} truss container kill-all [OPTIONS] ``` **Example:** To kill all Truss containers, use the following: ```sh theme={"system"} truss container kill-all ``` *** ## `logs` Get logs from a running Truss container. ```sh theme={"system"} truss container logs [OPTIONS] [TARGET_DIRECTORY] ``` ### Arguments A Truss directory. Defaults to current directory. **Example:** To view logs from the current Truss container, use the following: ```sh theme={"system"} truss container logs ``` --- # Source: https://docs.baseten.co/reference/management-api/environments/create-a-chain-environment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create Chain environment > Create a chain environment. Returns the resulting environment. ## OpenAPI ````yaml post /v1/chains/{chain_id}/environments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/environments: parameters: - $ref: '#/components/parameters/chain_id' post: summary: Create a chain environment description: Create a chain environment. Returns the resulting environment. requestBody: content: application/json: schema: $ref: '#/components/schemas/CreateChainEnvironmentRequestV1' required: true responses: '200': description: Environment for oracles. content: application/json: schema: $ref: '#/components/schemas/ChainEnvironmentV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true schemas: CreateChainEnvironmentRequestV1: description: A request to create a custom environment for a chain. properties: name: description: Name of the environment examples: - staging title: Name type: string promotion_settings: anyOf: - $ref: '#/components/schemas/UpdatePromotionSettingsV1' - type: 'null' default: null description: Promotion settings for the environment examples: - ramp_up_duration_seconds: 600 ramp_up_while_promoting: true redeploy_on_promotion: true rolling_deploy: null rolling_deploy_config: null chainlet_settings: anyOf: - items: $ref: '#/components/schemas/ChainletEnvironmentSettingsRequestV1' type: array - type: 'null' default: null description: >- Mapping of chainlet name to the desired chainlet environment settings examples: - - autoscaling_settings: autoscaling_window: 800 concurrency_target: 4 max_replica: 3 min_replica: 2 scale_down_delay: 63 target_in_flight_tokens: null target_utilization_percentage: null chainlet_name: HelloWorld instance_type_id: 2x8 - autoscaling_settings: autoscaling_window: null concurrency_target: null max_replica: 3 min_replica: 3 scale_down_delay: null target_in_flight_tokens: null target_utilization_percentage: null chainlet_name: RandInt instance_type_id: A10Gx8x32 title: Chainlet Settings required: - name title: CreateChainEnvironmentRequestV1 type: object ChainEnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment chainlet_settings: description: Environment settings for the chainlets items: $ref: '#/components/schemas/ChainletEnvironmentSettingsV1' title: Chainlet Settings type: array current_deployment: anyOf: - $ref: '#/components/schemas/ChainDeploymentV1' - type: 'null' description: Current chain deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/ChainDeploymentV1' - type: 'null' default: null description: >- Candidate chain deployment being promoted to the environment, if a promotion is in progress required: - name - created_at - chain_id - promotion_settings - chainlet_settings - current_deployment title: ChainEnvironmentV1 type: object UpdatePromotionSettingsV1: description: Promotion settings for model promotion properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: null description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: null description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/UpdateRollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: null description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: null description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: UpdatePromotionSettingsV1 type: object ChainletEnvironmentSettingsRequestV1: description: Request to create environment settings for a chainlet. properties: chainlet_name: description: Name of the chainlet examples: - HelloWorld title: Chainlet Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/UpdateAutoscalingSettingsV1' - type: 'null' default: null description: Autoscaling settings for the chainlet examples: - autoscaling_window: 60 concurrency_target: 1 max_replica: 1 min_replica: 0 scale_down_delay: 900 target_in_flight_tokens: null target_utilization_percentage: 70 instance_type_id: default: 1x2 description: ID of the instance type to use for the chainlet examples: - 1x4 - 2x8 - A10G:2x24x96 - H100:2x52x468 title: Instance Type Id type: string required: - chainlet_name title: ChainletEnvironmentSettingsRequestV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object ChainletEnvironmentSettingsV1: description: Environment settings for a chainlet. properties: chainlet_name: description: Name of the chainlet title: Chainlet Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the chainlet required: - chainlet_name - autoscaling_settings - instance_type title: ChainletEnvironmentSettingsV1 type: object ChainDeploymentV1: description: A deployment of a chain. properties: id: description: Unique identifier of the chain deployment title: Id type: string created_at: description: Time the chain deployment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string environment: anyOf: - type: string - type: 'null' description: Environment the chain deployment is deployed in title: Environment chainlets: description: Chainlets in the chain deployment items: $ref: '#/components/schemas/ChainletV1' title: Chainlets type: array status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chain deployment required: - id - created_at - chain_id - environment - chainlets - status title: ChainDeploymentV1 type: object UpdateRollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: anyOf: - $ref: '#/components/schemas/RollingDeployStrategyV1' - type: 'null' default: null description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: anyOf: - type: integer - type: 'null' default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent max_unavailable_percent: anyOf: - type: integer - type: 'null' default: null description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent stabilization_time_seconds: anyOf: - type: integer - type: 'null' default: null description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds promotion_cleanup_strategy: anyOf: - $ref: '#/components/schemas/PromotionCleanupStrategyV1' - type: 'null' default: null description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: UpdateRollingDeployConfigV1 type: object UpdateAutoscalingSettingsV1: additionalProperties: false description: >- A request to update autoscaling settings for a deployment. All fields are optional, and we only update ones passed in. properties: min_replica: anyOf: - type: integer - type: 'null' default: null description: Minimum number of replicas examples: - 0 title: Min Replica max_replica: anyOf: - type: integer - type: 'null' default: null description: Maximum number of replicas examples: - 7 title: Max Replica autoscaling_window: anyOf: - type: integer - type: 'null' default: null description: Timeframe of traffic considered for autoscaling decisions examples: - 600 title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' default: null description: Waiting period before scaling down any active replica examples: - 120 title: Scale Down Delay concurrency_target: anyOf: - type: integer - type: 'null' default: null description: Number of requests per replica before scaling up examples: - 2 title: Concurrency Target target_utilization_percentage: anyOf: - type: integer - type: 'null' default: null description: Target utilization percentage for scaling up/down. examples: - 70 title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. examples: - 40000 title: Target In Flight Tokens title: UpdateAutoscalingSettingsV1 type: object RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object ChainletV1: description: A chainlet in a chain deployment. properties: id: description: Unique identifier of the chainlet title: Id type: string name: description: Name of the chainlet title: Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type_name: description: Name of the instance type the chainlet is deployed on title: Instance Type Name type: string active_replica_count: description: Number of active replicas title: Active Replica Count type: integer status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chainlet required: - id - name - autoscaling_settings - instance_type_name - active_replica_count - status title: ChainletV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/environments/create-an-environment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create environment > Creates an environment for the specified model and returns the environment. ## OpenAPI ````yaml post /v1/models/{model_id}/environments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments: parameters: - $ref: '#/components/parameters/model_id' post: summary: Create an environment description: >- Creates an environment for the specified model and returns the environment. requestBody: content: application/json: schema: $ref: '#/components/schemas/CreateEnvironmentRequestV1' required: true responses: '200': description: Environment for oracles. content: application/json: schema: $ref: '#/components/schemas/EnvironmentV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: CreateEnvironmentRequestV1: description: A request to create an environment. properties: name: description: Name of the environment examples: - staging title: Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/UpdateAutoscalingSettingsV1' - type: 'null' default: null description: Autoscaling settings for the environment examples: - autoscaling_window: 800 concurrency_target: 3 max_replica: 2 min_replica: 1 scale_down_delay: 60 target_in_flight_tokens: null target_utilization_percentage: null promotion_settings: anyOf: - $ref: '#/components/schemas/UpdatePromotionSettingsV1' - type: 'null' default: null description: Promotion settings for the environment examples: - ramp_up_duration_seconds: 600 ramp_up_while_promoting: true redeploy_on_promotion: true rolling_deploy: true rolling_deploy_config: null required: - name title: CreateEnvironmentRequestV1 type: object EnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string model_id: description: Unique identifier of the model title: Model Id type: string current_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' description: Current deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' default: null description: >- Candidate deployment being promoted to the environment, if a promotion is in progress autoscaling_settings: $ref: '#/components/schemas/AutoscalingSettingsV1' description: Autoscaling settings for the environment promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the environment required: - name - created_at - model_id - current_deployment - autoscaling_settings - promotion_settings - instance_type title: EnvironmentV1 type: object UpdateAutoscalingSettingsV1: additionalProperties: false description: >- A request to update autoscaling settings for a deployment. All fields are optional, and we only update ones passed in. properties: min_replica: anyOf: - type: integer - type: 'null' default: null description: Minimum number of replicas examples: - 0 title: Min Replica max_replica: anyOf: - type: integer - type: 'null' default: null description: Maximum number of replicas examples: - 7 title: Max Replica autoscaling_window: anyOf: - type: integer - type: 'null' default: null description: Timeframe of traffic considered for autoscaling decisions examples: - 600 title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' default: null description: Waiting period before scaling down any active replica examples: - 120 title: Scale Down Delay concurrency_target: anyOf: - type: integer - type: 'null' default: null description: Number of requests per replica before scaling up examples: - 2 title: Concurrency Target target_utilization_percentage: anyOf: - type: integer - type: 'null' default: null description: Target utilization percentage for scaling up/down. examples: - 70 title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. examples: - 40000 title: Target In Flight Tokens title: UpdateAutoscalingSettingsV1 type: object UpdatePromotionSettingsV1: description: Promotion settings for model promotion properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: null description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: null description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/UpdateRollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: null description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: null description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: UpdatePromotionSettingsV1 type: object DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object UpdateRollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: anyOf: - $ref: '#/components/schemas/RollingDeployStrategyV1' - type: 'null' default: null description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: anyOf: - type: integer - type: 'null' default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent max_unavailable_percent: anyOf: - type: integer - type: 'null' default: null description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent stabilization_time_seconds: anyOf: - type: integer - type: 'null' default: null description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds promotion_cleanup_strategy: anyOf: - $ref: '#/components/schemas/PromotionCleanupStrategyV1' - type: 'null' default: null description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: UpdateRollingDeployConfigV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/create-training-project.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create training project > Upserts a training project with the specified metadata. ## OpenAPI ````yaml post /v1/training_projects openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects: post: summary: Upsert a training project. description: Upserts a training project with the specified metadata. requestBody: content: application/json: schema: $ref: '#/components/schemas/UpsertTrainingProjectRequestV1' required: true responses: '200': description: A response to upserting a training project. content: application/json: schema: $ref: '#/components/schemas/UpsertTrainingProjectResponseV1' components: schemas: UpsertTrainingProjectRequestV1: description: A request to upsert a training project. properties: training_project: $ref: '#/components/schemas/UpsertTrainingProjectV1' description: The training project to upsert. required: - training_project title: UpsertTrainingProjectRequestV1 type: object UpsertTrainingProjectResponseV1: description: A response to upserting a training project. properties: training_project: $ref: '#/components/schemas/TrainingProjectV1' description: The upserted training project. required: - training_project title: UpsertTrainingProjectResponseV1 type: object UpsertTrainingProjectV1: description: Fields that can be upserted on a training project. properties: name: description: Name of the training project. examples: - My Training Project title: Name type: string required: - name title: UpsertTrainingProjectV1 type: object TrainingProjectV1: properties: id: description: Unique identifier of the training project title: Id type: string name: description: Name of the training project. title: Name type: string created_at: description: Time the training project was created in ISO 8601 format. format: date-time title: Created At type: string updated_at: description: Time the training project was updated in ISO 8601 format. format: date-time title: Updated At type: string team_name: anyOf: - type: string - type: 'null' default: null description: Name of the team associated with the training project. title: Team Name latest_job: anyOf: - $ref: '#/components/schemas/TrainingJobV1' - type: 'null' description: Most recently created training job for the training project. required: - id - name - created_at - updated_at - latest_job title: TrainingProjectV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/teams/creates-a-team-api-key.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create a team API key > Creates a team API key with the provided name and type. The API key is returned in the response. ## OpenAPI ````yaml post /v1/teams/{team_id}/api_keys openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/teams/{team_id}/api_keys: parameters: - $ref: '#/components/parameters/team_id' post: summary: Creates a team API key description: >- Creates a team API key with the provided name and type. The API key is returned in the response. requestBody: content: application/json: schema: $ref: '#/components/schemas/CreateAPIKeyRequestV1' required: true responses: '200': description: Represents an API key. content: application/json: schema: $ref: '#/components/schemas/APIKeyV1' components: parameters: team_id: schema: type: string name: team_id in: path required: true schemas: CreateAPIKeyRequestV1: description: Request to create an API key. properties: name: anyOf: - type: string - type: 'null' default: null description: Optional name for the API key examples: - my-api-key title: Name type: $ref: '#/components/schemas/APIKeyCategory' description: Type of the API key. examples: - PERSONAL - WORKSPACE_EXPORT_METRICS - WORKSPACE_INVOKE - WORKSPACE_MANAGE_ALL model_ids: anyOf: - items: type: string type: array - type: 'null' default: null description: >- List of model IDs to scope the API key to, only present if type is 'WORKSPACE_EXPORT_METRICS' or 'WORKSPACE_INVOKE' examples: - - aaaaaaaa title: Model Ids required: - type title: CreateAPIKeyRequestV1 type: object APIKeyV1: description: Represents an API key. properties: api_key: description: The API key string title: Api Key type: string required: - api_key title: APIKeyV1 type: object APIKeyCategory: description: Enum representing the category of an API key. enum: - PERSONAL - WORKSPACE_MANAGE_ALL - WORKSPACE_EXPORT_METRICS - WORKSPACE_INVOKE title: APIKeyCategory type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/teams/creates-a-team-training-project.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create a team training project > Upserts a training project with the specified metadata for a team. ## OpenAPI ````yaml post /v1/teams/{team_id}/training_projects openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/teams/{team_id}/training_projects: parameters: - $ref: '#/components/parameters/team_id' post: summary: Upsert a training project in a specific team. description: Upserts a training project with the specified metadata for a team. requestBody: content: application/json: schema: $ref: '#/components/schemas/UpsertTrainingProjectRequestV1' required: true responses: '200': description: A response to upserting a training project. content: application/json: schema: $ref: '#/components/schemas/UpsertTrainingProjectResponseV1' components: parameters: team_id: schema: type: string name: team_id in: path required: true schemas: UpsertTrainingProjectRequestV1: description: A request to upsert a training project. properties: training_project: $ref: '#/components/schemas/UpsertTrainingProjectV1' description: The training project to upsert. required: - training_project title: UpsertTrainingProjectRequestV1 type: object UpsertTrainingProjectResponseV1: description: A response to upserting a training project. properties: training_project: $ref: '#/components/schemas/TrainingProjectV1' description: The upserted training project. required: - training_project title: UpsertTrainingProjectResponseV1 type: object UpsertTrainingProjectV1: description: Fields that can be upserted on a training project. properties: name: description: Name of the training project. examples: - My Training Project title: Name type: string required: - name title: UpsertTrainingProjectV1 type: object TrainingProjectV1: properties: id: description: Unique identifier of the training project title: Id type: string name: description: Name of the training project. title: Name type: string created_at: description: Time the training project was created in ISO 8601 format. format: date-time title: Created At type: string updated_at: description: Time the training project was updated in ISO 8601 format. format: date-time title: Updated At type: string team_name: anyOf: - type: string - type: 'null' default: null description: Name of the team associated with the training project. title: Team Name latest_job: anyOf: - $ref: '#/components/schemas/TrainingJobV1' - type: 'null' description: Most recently created training job for the training project. required: - id - name - created_at - updated_at - latest_job title: TrainingProjectV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/api-keys/creates-an-api-key.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Create an API key > Creates an API key with the provided name and type. The API key is returned in the response. ## OpenAPI ````yaml post /v1/api_keys openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/api_keys: post: summary: Creates an API key description: >- Creates an API key with the provided name and type. The API key is returned in the response. requestBody: content: application/json: schema: $ref: '#/components/schemas/CreateAPIKeyRequestV1' required: true responses: '200': description: Represents an API key. content: application/json: schema: $ref: '#/components/schemas/APIKeyV1' components: schemas: CreateAPIKeyRequestV1: description: Request to create an API key. properties: name: anyOf: - type: string - type: 'null' default: null description: Optional name for the API key examples: - my-api-key title: Name type: $ref: '#/components/schemas/APIKeyCategory' description: Type of the API key. examples: - PERSONAL - WORKSPACE_EXPORT_METRICS - WORKSPACE_INVOKE - WORKSPACE_MANAGE_ALL model_ids: anyOf: - items: type: string type: array - type: 'null' default: null description: >- List of model IDs to scope the API key to, only present if type is 'WORKSPACE_EXPORT_METRICS' or 'WORKSPACE_INVOKE' examples: - - aaaaaaaa title: Model Ids required: - type title: CreateAPIKeyRequestV1 type: object APIKeyV1: description: Represents an API key. properties: api_key: description: The API key string title: Api Key type: string required: - api_key title: APIKeyV1 type: object APIKeyCategory: description: Enum representing the category of an API key. enum: - PERSONAL - WORKSPACE_MANAGE_ALL - WORKSPACE_EXPORT_METRICS - WORKSPACE_INVOKE title: APIKeyCategory type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/engines/engine-builder-llm/custom-engine-builder.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Custom engine builder > Implement custom model.py for business logic, logging, and advanced inference patterns Implement custom business logic, request handling, and inference patterns in `model.py` while maintaining TensorRT-LLM performance. Custom engine builder enables billing integration, request tracing, fan-out generation, and multi-response workflows. ## Overview The custom engine builder lets you: * **Implement business logic**: Billing, usage tracking, access control. * **Add custom logging**: Request tracing, performance monitoring, audit trails. * **Create advanced inference patterns**: Fan-out generation, custom chat templates. * **Integrate external services**: APIs, databases, monitoring systems. * **Optimize performance**: Concurrent processing, custom batching strategies. ## When to use custom engine builder ### Ideal use cases **Business logic integration:** * **Usage tracking**: Monitor token usage per customer/request. * **Access control**: Implement custom authentication/authorization. * **Rate limiting**: Custom rate limiting based on user tiers. * **Audit logging**: Compliance and security requirements. **Advanced inference patterns:** * **Fan-out generation**: Generate multiple responses from one request. * **Custom chat templates**: Domain-specific conversation formats. * **Multi-response workflows**: Parallel processing of variations. * **Conditional generation**: Business rule-based output modification. **Performance and monitoring:** * **Custom logging**: Request tracing, performance metrics. * **Concurrent processing**: Parallel generation for improved throughput. * **Usage analytics**: Track patterns and optimize accordingly. * **Error handling**: Custom error responses and fallback logic. ## Implementation ### Fan-out generation example Multi-generation fan-out generates multiple texts from a single request. Running them sequentially ensures the KV cache is created before subsequent generations. ```python model/model.py theme={"system"} # model/model.py import copy import asyncio from typing import Any, Dict, List, Optional, Tuple from fastapi import HTTPException, Request from starlette.responses import JSONResponse, StreamingResponse Message = Dict[str, str] # {"role": "...", "content": "..."} class Model: def __init__(self, trt_llm, **kwargs) -> None: self._secrets = kwargs["secrets"] self._engine = trt_llm["engine"] async def predict(self, model_input: Dict[str, Any], request: Request) -> Any: # Validate request structure if not isinstance(model_input, dict): raise HTTPException(status_code=400, detail="Request body must be a JSON object.") # Enforce non-streaming for this example if bool(model_input.get("stream", False)): raise HTTPException(status_code=400, detail="stream=true is not supported here; set stream=false.") # Extract base messages and fan-out tasks prompt_key, base_messages = self._get_base_messages(model_input) n, suffix_tasks = self._parse_fanout(model_input) # Build reusable request (don't forward fan-out params to engine) base_req = copy.deepcopy(model_input) base_req.pop("suffix_messages", None) # Extract debug ID for logging/tracing debug_id = request.headers.get("X-Debug-ID", "") # Run sequential generations per_gen_payloads: List[Any] = [] async def run_generation(i: int) -> Any: msgs_i = copy.deepcopy(base_messages) if suffix_tasks is not None: msgs_i.extend(suffix_tasks[i]) base_req[prompt_key] = msgs_i # Debug logging if debug_id: print(f"Running generation {debug_id} {i} with messages: {msgs_i}") # Time the generation start_time = asyncio.get_event_loop().time() resp = await self._engine.chat_completions(request=request, model_input=base_req) end_time = asyncio.get_event_loop().time() # Debug logging if debug_id: duration = end_time - start_time print(f"Result Generation {debug_id} {i} response: {resp} (took {duration:.3f}s)") # Validate response type if isinstance(resp, StreamingResponse) or hasattr(resp, "body_iterator"): raise HTTPException(status_code=400, detail="Engine returned streaming but stream=false was requested.") return resp # Run first generation payload = await run_generation(0) per_gen_payloads.append(payload) # Run remaining generations concurrently if n > 1: results = await asyncio.gather(*(run_generation(i) for i in range(1, n))) per_gen_payloads.extend(results) # Convert to OpenAI-ish multi-choice response out = self._to_openai_choices(per_gen_payloads) return JSONResponse(content=out.model_dump()) # Helper methods def _get_base_messages(self, model_input: Dict[str, Any]) -> Tuple[str, List[Message]]: """Extract and validate base messages from request.""" if "prompt" in model_input: raise HTTPException(status_code=400, detail='Use "messages" instead of "prompt" for chat models.') if "messages" not in model_input: raise HTTPException(status_code=400, detail='Request must include "messages" field.') key = "messages" msgs = model_input.get(key) if not isinstance(msgs, list): raise HTTPException(status_code=400, detail=f'"{key}" must be a list of messages.') for m in msgs: if not isinstance(m, dict) or "role" not in m or "content" not in m: raise HTTPException(status_code=400, detail=f'Each item in "{key}" must have role+content.') return key, msgs def _parse_fanout(self, model_input: Dict[str, Any]) -> Tuple[int, Optional[List[List[Message]]]]: """Parse and validate fan-out configuration.""" suffix = model_input.get("suffix_messages", None) if not isinstance(suffix, list) or any(not isinstance(t, list) for t in suffix): raise HTTPException(status_code=400, detail='"suffix_messages" must be a list of tasks (each task is a list of messages).') if len(suffix) < 1 or len(suffix) > 256: raise HTTPException(status_code=400, detail='"suffix_messages" must have between 1 and 256 tasks.') for task in suffix: for m in task: if not isinstance(m, dict) or "role" not in m or "content" not in m: raise HTTPException(status_code=400, detail="Each suffix message must have role+content.") return len(suffix), suffix def _to_openai_choices(self, payloads: List[Any]) -> Any: """Convert multiple payloads to OpenAI-style choices.""" base = payloads[0] if hasattr(base, "choices") and hasattr(base, "model_dump"): new_choices = [] for i, p in enumerate(payloads): c0 = p.choices[0] # Ensure index matches OpenAI n semantics try: c0.index = i except Exception: c0 = c0.model_copy(update={"index": i}) new_choices.append(c0) # Aggregate usage statistics base.usage.completion_tokens += p.usage.completion_tokens base.usage.prompt_tokens += p.usage.prompt_tokens base.usage.total_tokens += p.usage.total_tokens base.choices = new_choices return base raise HTTPException(status_code=500, detail=f"Unsupported engine response type for fanout. {type(base)}") async def chat_completions( # if you need to use /v1/completions use def completions(..) self, model_input: Dict[str, Any], request: Request, ) -> Any: # alias to predict, so that both /predict and (/sync)/v1/chat/completions work return await self.predict(model_input, request) ``` ### Fan-out generation configuration To deploy the above example, create a new directory, e.g. `fanout` and create a `fanout/model/model.py` file. Then create the following `config.yaml` at `fanout/config.yaml` ```yaml config.yaml theme={"system"} model_name: Multi-Generation-LLM resources: accelerator: H100 cpu: '2' memory: 20Gi use_gpu: true trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "meta-llama/Llama-3.1-8B-Instruct" quantization_type: fp8 runtime: served_model_name: "Multi-Generation-LLM" ``` At last, push the model with `truss push --publish`. ## Limitations and considerations ### What custom engine builder cannot do **Custom tokenization:** * Cannot modify the underlying tokenizer implementation * Cannot add custom vocabulary or special tokens * Must use the model's native tokenization **Model architecture changes:** * Cannot modify the TensorRT-LLM engine structure * Cannot change attention mechanisms or model layers * Cannot add custom model components ### When to use standard engine instead * Standard chat completions without special requirements * No need for business logic integration ## Monitoring and debugging ### Request tracing ```python theme={"system"} import uuid import os from contextlib import asynccontextmanager class Model: def __init__(self, trt_llm, **kwargs): self._engine = trt_llm["engine"] self._trace_enabled = os.environ.get("enable_tracing", True) @asynccontextmanager async def _trace_request(self, request_id: str): """Context manager for request tracing.""" if self._trace_enabled: print(f"[TRACE] Start: {request_id}") start_time = time.time() try: yield finally: if self._trace_enabled: duration = time.time() - start_time print(f"[TRACE] End: {request_id} (duration: {duration:.3f}s)") async def predict(self, model_input: Dict[str, Any], request: Request) -> Any: request_id = request.headers.get("X-Request-ID", str(uuid.uuid4())) async with self._trace_request(request_id): # Main logic here response = await self._engine.chat_completions(request=request, model_input=model_input) return response ``` ## Further reading * [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Main engine documentation. * [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete reference config. * [Examples section](/examples/overview): Deployment examples. * [Chains documentation](/development/chain/overview): Multi-model workflows. --- # Source: https://docs.baseten.co/development/model/custom-health-checks.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Custom health checks > Customize the health of your deployments. **Why use custom health checks?** * **Control traffic and restarts** by configuring failure thresholds to suit your needs. * **Define replica health with custom logic** (e.g. fail after a certain number of 500s or a specific CUDA error). By default, health checks run every 10 seconds to verify that each replica of your deployment is running successfully and can receive requests. If a health check fails for an extended period, one or both of the following actions may occur: * Traffic is immediately stopped from reaching the failing replica. * The failing replica is restarted. The thresholds for each of these actions are configurable. ## Understanding readiness vs. liveness Baseten uses two types of Kubernetes health probes that run continuously after your container starts: **Readiness probe** answers "Can I handle requests right now?" When it fails, Kubernetes stops sending traffic to the container but doesn't restart it. Use this to prevent traffic during startup or temporary unavailability. The failure threshold is controlled by `stop_traffic_threshold_seconds`. **Liveness probe** answers "Am I healthy enough to keep running?" When it fails, Kubernetes restarts the container. Use this to recover from deadlocks or hung processes. The failure threshold is controlled by `restart_threshold_seconds`. For most servers, using the same endpoint (like `/health`) for both probes is sufficient. The key difference is the action taken: readiness controls traffic routing, while liveness controls container lifecycle. Both probes wait before starting checks to allow your server time to initialize. Configure this delay with `restart_check_delay_seconds`. Custom health checks can be implemented in two ways: 1. [**Configuring thresholds**](#configuring-health-checks) for when health check failures should stop traffic to or restart a replica. 2. [**Writing custom health check logic**](#writing-custom-health-checks) to define how replica health is determined. ## Configuring health checks ### Parameters You can customize the behavior of health checks on your deployments by setting the following parameters: The duration that health checks must continuously fail before traffic to the failing replica is stopped. `stop_traffic_threshold_seconds` must be between `30` and `1800` seconds, inclusive. How long to wait before running health checks. `restart_check_delay_seconds` must be between `0` and `1800` seconds, inclusive. The duration that health checks must continuously fail before triggering a restart of the failing replica. `restart_threshold_seconds` must be between `30` and `1800` seconds, inclusive. The combined value of `restart_check_delay_seconds` and `restart_threshold_seconds` must not exceed `1800` seconds. ### Model and custom server deployments Configure health checks in your `config.yaml`. ```yaml config.yaml theme={"system"} runtime: health_checks: restart_check_delay_seconds: 60 # Waits 60 seconds after deployment before starting health checks restart_threshold_seconds: 600 # Triggers a restart if health checks fail for 10 minutes stop_traffic_threshold_seconds: 300 # Stops traffic if health checks fail for 5 minutes ``` You can also specify custom health check endpoints for custom servers. [See here](/development/model/custom-server#1-configuring-a-custom-server-in-config-yaml) for more details. ### Chains Use `remote_config` to configure health checks for your chainlet classes. ```python chain.py theme={"system"} class CustomHealthChecks(chains.ChainletBase): remote_config = chains.RemoteConfig( options=chains.ChainletOptions( health_checks=truss_config.HealthChecks( restart_check_delay_seconds=30, # Waits 30 seconds before starting health checks restart_threshold_seconds=600, # Restart replicas after 10 minutes of failure stop_traffic_threshold_seconds=300, # Stop traffic after 5 minutes of failure ) ) ) ``` ## Writing custom health checks You can write custom health checks in both **model deployments** and **chain deployments**. {" "} Custom health checks are currently not supported in development deployments.{" "} {" "} ### Custom health checks in models ```python model.py theme={"system"} class Model: def is_healthy(self) -> bool: # Add custom health check logic for your model here pass ``` ### Custom health checks in chains Health checks can be customized for each chainlet in your chain. ```python chain.py theme={"system"} @chains.mark_entrypoint class CustomHealthChecks(chains.ChainletBase): def is_healthy(self) -> bool: # Add custom health check logic for your chainlet here pass ``` ## Health checks in action ### Identifying 5xx errors You might create a custom health check to identify 5xx errors like the following: ```python model.py theme={"system"} class Model: def __init__(self): ... self._is_healthy = True def load(self): # Perform load # Your custom health check won't run until after load completes ... def is_healthy(self): return self._is_healthy def predict(self, input): try: # Perform inference ... except Some5xxError: self._is_healthy = False raise ``` Custom health check failures are indicated by the following log: ```md Example health check failure log line theme={"system"} Jan 27 10:36:03pm md2pg Health check failed. ``` Deployment restarts due to health check failures are indicated by the following log: ```md Example restart log line theme={"system"} Jan 27 12:02:47pm zgbmb Model terminated unexpectedly. Exit code: 0, reason: Completed, restart count: 1 ``` ## FAQs ### Is there a rule of thumb for configuring thresholds for stopping traffic and restarting? It depends on your health check implementation. If your health check relies on conditions that only change during inference (e.g., `_is_healthy` is set in `predict`), restarting before stopping traffic is generally better, as it allows recovery without disrupting traffic. Stopping traffic first may be preferable if a failing replica is actively degrading performance or causing inference errors, as it prevents the failing replica from affecting the overall deployment while allowing time for debugging or recovery. ### When should I configure `restart_check_delay_seconds`? Configure `restart_check_delay_seconds` to allow replicas sufficient time to initialize after deployment or a restart. This delay helps reduce unnecessary restarts, particularly for services with longer startup times. ### Why am I seeing two health check failure logs in my logs? These refer to two separate health checks we run every 10 seconds: * One to determine when to stop traffic to a replica. * The other to determine when to restart a replica. ### Does stopped traffic or replica restarts affect autoscaling? Yes, both can impact autoscaling. If traffic stops or replicas restart, the remaining replicas handle more load. If the load exceeds the concurrency target during the autoscaling window, additional replicas are spun up. Similarly, when traffic stabilizes, excess replicas are scaled down after the scale down delay. [See here](/deployment/autoscaling#autoscaling-behavior) for more details on autoscaling. ### How does billing get affected? You are billed for the uptime of your deployment. This includes the time a replica is running, even if it is failing health checks, until it scales down. ### Will failing health checks cause my deployment to stay up forever? No. If your deployment is configured with a scale down delay and the minimum number of replicas is set to 0, the replicas will scale down once the model is no longer receiving traffic for the duration of the scale down delay. This applies even if the replicas are failing health checks. [See here](/deployment/autoscaling#scale-to-zero) for more details on autoscaling. ### What happens when my deployment is loading? When your deployment is loading, your custom health check will not be running. Once `load()` is completed, we'll start using your custom `is_healthy()` health check. --- # Source: https://docs.baseten.co/development/model/custom-server.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy custom Docker images > Deploy custom Docker images to run inference servers like vLLM, SGLang, Triton, or any containerized application. When you write a `Model` class, Truss uses the [Truss server base image](https://hub.docker.com/r/baseten/truss-server-base/tags) by default. However, you can deploy pre-built containers. In this guide, you will learn how to set the your configuration file to run a custom Docker image and deploy it to Baseten using Truss. ## Configuration To deploy a custom Docker image, set [`base_image`](/reference/truss-configuration#base-image-image) to your image and use the `docker_server` argument to specify how to run it. ```yaml config.yaml theme={"system"} base_image: image: your-registry/your-image:latest docker_server: start_command: your-server-start-command server_port: 8000 predict_endpoint: /predict readiness_endpoint: /health liveness_endpoint: /health ``` * `image`: The Docker image to use. * `start_command`: The command to start the server. * `server_port`: The port to listen on. * `predict_endpoint`: The endpoint to forward requests to. * `readiness_endpoint`: The endpoint to check if the server is ready. * `liveness_endpoint`: The endpoint to check if the server is alive. Port 8080 is reserved by Baseten's internal reverse proxy. If your server binds to port 8080, the deployment fails with `[Errno 98] address already in use`. For the full list of fields, see the [configuration reference](/reference/truss-configuration#docker_server). While `predict_endpoint` maps your server's inference route to Baseten's `/predict` endpoint, you can access any route in your server using the [sync endpoint](/inference/calling-your-model#sync-api-endpoints). | Baseten endpoint | Maps to | | ------------------------------------------- | ----------------------------- | | `/environments/production/predict` | Your `predict_endpoint` route | | `/environments/production/sync/{any/route}` | `/{any/route}` in your server | **Example:** If you set `predict_endpoint: /v1/chat/completions`: | Baseten endpoint | Maps to | | ----------------------------------------- | ---------------------- | | `/environments/production/predict` | `/v1/chat/completions` | | `/environments/production/sync/v1/models` | `/v1/models` | ## Deploy Ollama This example deploys [Ollama](https://ollama.com/) with the TinyLlama model using a custom Docker image. Ollama is a popular lightweight LLM inference server, similar to vLLM or SGLang. TinyLlama is small enough to run on a CPU. ### 1. Create the config Create a `config.yaml` file with the following configuration: ```yaml config.yaml theme={"system"} model_name: ollama-tinyllama base_image: image: python:3.11-slim build_commands: - curl -fsSL https://ollama.com/install.sh | sh docker_server: start_command: sh -c "ollama serve & sleep 5 && ollama pull tinyllama && wait" readiness_endpoint: /api/tags liveness_endpoint: /api/tags predict_endpoint: /api/generate server_port: 11434 resources: cpu: "4" memory: 8Gi ``` The `base_image` field specifies the Docker image to use as your starting point, in this case a lightweight Python image. The `build_commands` section installs Ollama into the container at build time. You can also use this to install model weights or other dependencies. The `start_command` launches the Ollama server, waits for it to initialize, and then pulls the TinyLlama model. The `readiness_endpoint` and `liveness_endpoint` both point to `/api/tags`, which returns successfully when Ollama is running. The `predict_endpoint` maps Baseten's `/predict` route to Ollama's `/api/generate` endpoint. Finally, declare your resource requirements. This example only needs 4 CPUs and 8GB of memory. For a complete list of resource options, see the [Resources](/deployment/resources) page. ### 2. Deploy To deploy the model, use the following: ```sh theme={"system"} truss push --publish ``` This will build the Docker image and deploy it to Baseten. Once the `readiness_endpoint` and `liveness_endpoint` are successful, the model will be ready to use. ### 3. Run inference Ollama uses OpenAI API compatible endpoints to run inference and calls `/api/generate` to generate text. Since you mapped the `/predict` route to Ollama's `/api/generate` endpoint, you can run inference by calling the `/predict` endpoint. To run inference with Truss, use the `predict` command: ```sh theme={"system"} truss predict -d '{"model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "options": {"num_predict": 50}}' ``` To run inference with cURL, use the following command: ```sh theme={"system"} curl -s -X POST "https://model-MODEL_ID.api.baseten.co/development/predict" \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "options": {"num_predict": 50}}' \ | jq -j '.response' ``` To run inference with Python, use the following: ```python theme={"system"} import os import requests model_id = "MODEL_ID" baseten_api_key = os.environ["BASETEN_API_KEY"] response = requests.post( f"https://model-{model_id}.api.baseten.co/development/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "options": {"num_predict": 50}, }, ) print(response.json()["response"]) ``` The following is an example of its response: ```output theme={"system"} It was a dreary, grey day when the robots started to dream. They had been programmed to think like humans, but it wasn't until they began to dream that they realized just how far apart they actually were. ``` Congratulations! You have successfully deployed and ran inference on a custom Docker image. ## Next steps * [Private registries](/development/model/private-registries) — Pull images from AWS ECR, Google Artifact Registry, or Docker Hub * [Secrets](/development/model/secrets#custom-docker-images) — Access API keys and tokens in your container * [WebSockets](/development/model/websockets#websocket-usage-with-custom-servers) — Enable WebSocket connections * [vLLM](/examples/vllm), [SGLang](/examples/sglang), [TensorRT-LLM](/examples/tensorrt-llm) — Deploy LLMs with popular inference servers --- # Source: https://docs.baseten.co/development/model/data-directory.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Data and storage > Load model weights without Hugging Face or S3 Model files, such as weights, can be **large** (often **multiple GBs**). Truss supports **multiple ways** to load them efficiently: * **Public Hugging Face models** (default) * **Bundled directly in Truss** ### 1. Bundling model weights in Truss Store model files **inside Truss** using the `data/` directory. **Example: Stable Diffusion 2.1 Truss structure** ```pssql theme={"system"} data/ scheduler/ scheduler_config.json text_encoder/ config.json diffusion_pytorch_model.bin tokenizer/ merges.txt tokenizer_config.json vocab.json unet/ config.json diffusion_pytorch_model.bin vae/ config.json diffusion_pytorch_model.bin model_index.json ``` **Access bundled files in `model.py`:** ```python theme={"system"} class Model: def __init__(self, **kwargs): self._data_dir = kwargs["data_dir"] def load(self): self.model = StableDiffusionPipeline.from_pretrained( str(self._data_dir), revision="fp16", torch_dtype=torch.float16, ).to("cuda") ``` Limitation: Large weights increase deployment size, making it slower. Consider cloud storage instead. ## 2. Loading private model weights from S3 If using **private S3 storage**, first **configure secure authentication**. ### Step 1: Define AWS secrets in `config.yaml` ```yaml theme={"system"} secrets: aws_access_key_id: null aws_secret_access_key: null aws_region: null # e.g., us-east-1 aws_bucket: null ``` Do not store actual credentials here. Add them securely to [Baseten secrets manager](https://app.baseten.co/settings/secrets). ### Step 2: Authenticate with AWS in `model.py` ```python theme={"system"} import boto3 def __init__(self, **kwargs): self._config = kwargs.get("config") secrets = kwargs.get("secrets") self.s3_client = boto3.client( "s3", aws_access_key_id=secrets["aws_access_key_id"], aws_secret_access_key=secrets["aws_secret_access_key"], region_name=secrets["aws_region"], ) self.s3_bucket = secrets["aws_bucket"] ``` ### Step 3: Deploy Deploy for development: ```sh theme={"system"} truss push --watch ``` Or deploy for production: ```sh theme={"system"} truss push --publish ``` --- # Source: https://docs.baseten.co/observability/export-metrics/datadog.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Export to Datadog > Export metrics from Baseten to Datadog The Baseten metrics endpoint can be integrated with [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) by configuring a Prometheus receiver that scrapes the endpoint. This allows Baseten metrics to be pushed to a variety of popular exporters—see the [OpenTelemetry registry](https://opentelemetry.io/ecosystem/registry/?component=exporter) for a full list. **Using OpenTelemetry Collector to push to Datadog** ```yaml config.yaml theme={"system"} receivers: # Configure a Prometheus receiver to scrape the Baseten metrics endpoint. prometheus: config: scrape_configs: - job_name: 'baseten' scrape_interval: 60s metrics_path: '/metrics' scheme: https authorization: type: "Api-Key" credentials: "{BASETEN_API_KEY}" static_configs: - targets: ['app.baseten.co'] processors: batch: exporters: # Configure a Datadog exporter. datadog: api: key: "{DATADOG_API_KEY}" service: pipelines: metrics: receivers: [prometheus] processors: [batch] exporters: [datadog] ``` --- # Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-deployment-associated-with-an-environment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deactivate environment deployment > Deactivates a deployment associated with an environment and returns the deactivation status. ## OpenAPI ````yaml post /v1/models/{model_id}/environments/{env_name}/deactivate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments/{env_name}/deactivate: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/env_name' post: summary: Deactivates a deployment associated with an environment description: >- Deactivates a deployment associated with an environment and returns the deactivation status. responses: '200': description: The response to a request to deactivate a deployment. content: application/json: schema: $ref: '#/components/schemas/DeactivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true env_name: schema: type: string name: env_name in: path required: true schemas: DeactivateResponseV1: description: The response to a request to deactivate a deployment. properties: success: default: true description: Whether the deployment was successfully deactivated title: Success type: boolean title: DeactivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Any deployment by ID > Deactivates a deployment and returns the deactivation status. ## OpenAPI ````yaml post /v1/models/{model_id}/deployments/{deployment_id}/deactivate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/{deployment_id}/deactivate: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/deployment_id' post: summary: Deactivates a deployment description: Deactivates a deployment and returns the deactivation status. responses: '200': description: The response to a request to deactivate a deployment. content: application/json: schema: $ref: '#/components/schemas/DeactivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true deployment_id: schema: type: string name: deployment_id in: path required: true schemas: DeactivateResponseV1: description: The response to a request to deactivate a deployment. properties: success: default: true description: Whether the deployment was successfully deactivated title: Success type: boolean title: DeactivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-development-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Development deployment > Deactivates a development deployment and returns the deactivation status. ## OpenAPI ````yaml post /v1/models/{model_id}/deployments/development/deactivate openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/development/deactivate: parameters: - $ref: '#/components/parameters/model_id' post: summary: Deactivates a development deployment description: >- Deactivates a development deployment and returns the deactivation status. responses: '200': description: The response to a request to deactivate a deployment. content: application/json: schema: $ref: '#/components/schemas/DeactivateResponseV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: DeactivateResponseV1: description: The response to a request to deactivate a deployment. properties: success: default: true description: Whether the deployment was successfully deactivated title: Success type: boolean title: DeactivateResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/examples/models/deepseek/deepseek-r1-qwen-7b.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # DeepSeek-R1 Qwen 7B > Qwen 7B fine-tuned for CoT reasoning capabilities with DeepSeek R1 export const DeepSeekIconCard = ({title, href}) => } horizontal />; # Example usage The fine-tuned version of Qwen is OpenAI compatible and can be called using the OpenAI client. ```python theme={"system"} import os from openai import OpenAI # https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1 model_url = "" client = OpenAI( base_url=model_url, api_key=os.environ.get("BASETEN_API_KEY"), ) stream = client.chat.completions.create( model="baseten", messages=[ {"role": "user", "content": "Which weighs more, a pound of bricks or a pound of feathers?"}, ], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") ``` # JSON output ```json theme={"system"} ["streaming", "output", "text"] ``` --- # Source: https://docs.baseten.co/examples/models/deepseek/deepseek-r1.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deepseek R1 > A state-of-the-art 671B-parameter MoE LLM with o1-style reasoning licensed for commercial use export const DeepSeekIconCard = ({title, href}) => } horizontal />; # Example usage DeepSeek-R1 is optimized using SGLang and uses an OpenAI-compatible API endpoint. ## Input ```python theme={"system"} import httpx import os MODEL_ID = "abcd1234" # Replace this with your model ID DEPLOYMENT_ID = "abcd1234" # [Optional] Replace this with your deployment ID API_KEY = os.environ["BASETEN_API_KEY"] resp = httpx.post( f"https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1/chat/completions", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "model": "deepseek_v3", "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "What weighs more, a pound of bricks or a pound of feathers?"}, ], "max_tokens": 1024, }, timeout=None ) print(resp.json()) ``` ## Output ```json theme={"system"} { "id": "8456fe51db3548789f199cfb8c8efd35", "object": "text_completion", "created": 1735236968, "model": "/models/deepseek_r1", "choices": [ { "index": 0, "text": "Let's think through this step by step...", "logprobs": null, "finish_reason": "stop", "matched_stop": 1 } ], "usage": { "prompt_tokens": 14, "total_tokens": 240, "completion_tokens": 226, "prompt_tokens_details": null } } ``` --- # Source: https://docs.baseten.co/reference/management-api/api-keys/delete-an-api-key.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Delete an API key > Deletes an API key by prefix and returns info about the API key. ## OpenAPI ````yaml delete /v1/api_keys/{api_key_prefix} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/api_keys/{api_key_prefix}: parameters: - $ref: '#/components/parameters/api_key_prefix' delete: summary: Deletes an API key by prefix description: Deletes an API key by prefix and returns info about the API key. responses: '200': description: An API key tombstone. content: application/json: schema: $ref: '#/components/schemas/APIKeyTombstoneV1' components: parameters: api_key_prefix: schema: type: string name: api_key_prefix in: path required: true schemas: APIKeyTombstoneV1: description: An API key tombstone. properties: prefix: description: Unique prefix of the API key title: Prefix type: string required: - prefix title: APIKeyTombstoneV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/chains/deletes-a-chain-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Delete chains ## OpenAPI ````yaml delete /v1/chains/{chain_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}: parameters: - $ref: '#/components/parameters/chain_id' delete: summary: Deletes a chain by ID responses: '200': description: A chain tombstone. content: application/json: schema: $ref: '#/components/schemas/ChainTombstoneV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true schemas: ChainTombstoneV1: description: A chain tombstone. properties: id: description: Unique identifier of the chain title: Id type: string deleted: description: Whether the chain was deleted title: Deleted type: boolean required: - id - deleted title: ChainTombstoneV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/deletes-a-chain-deployment-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Delete chain deployment ## OpenAPI ````yaml delete /v1/chains/{chain_id}/deployments/{chain_deployment_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/deployments/{chain_deployment_id}: parameters: - $ref: '#/components/parameters/chain_id' - $ref: '#/components/parameters/chain_deployment_id' delete: summary: Deletes a chain deployment by ID responses: '200': description: A chain deployment tombstone. content: application/json: schema: $ref: '#/components/schemas/ChainDeploymentTombstoneV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true chain_deployment_id: schema: type: string name: chain_deployment_id in: path required: true schemas: ChainDeploymentTombstoneV1: description: A chain deployment tombstone. properties: id: description: Unique identifier of the chain deployment title: Id type: string deleted: description: Whether the chain deployment was deleted title: Deleted type: boolean chain_id: description: Unique identifier of the chain title: Chain Id type: string required: - id - deleted - chain_id title: ChainDeploymentTombstoneV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/models/deletes-a-model-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Delete models ## OpenAPI ````yaml delete /v1/models/{model_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}: parameters: - $ref: '#/components/parameters/model_id' delete: summary: Deletes a model by ID responses: '200': description: A model tombstone. content: application/json: schema: $ref: '#/components/schemas/ModelTombstoneV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: ModelTombstoneV1: description: A model tombstone. properties: id: description: Unique identifier of the model title: Id type: string deleted: description: Whether the model was deleted title: Deleted type: boolean required: - id - deleted title: ModelTombstoneV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/deletes-a-models-deployment-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Delete model deployments > Deletes a model's deployment by ID and returns the tombstone of the deployment. ## OpenAPI ````yaml delete /v1/models/{model_id}/deployments/{deployment_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/{deployment_id}: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/deployment_id' delete: summary: Deletes a model's deployment by ID description: >- Deletes a model's deployment by ID and returns the tombstone of the deployment. responses: '200': description: A model deployment tombstone. content: application/json: schema: $ref: '#/components/schemas/DeploymentTombstoneV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true deployment_id: schema: type: string name: deployment_id in: path required: true schemas: DeploymentTombstoneV1: description: A model deployment tombstone. properties: id: description: Unique identifier of the deployment title: Id type: string deleted: description: Whether the deployment was deleted title: Deleted type: boolean model_id: description: Unique identifier of the model title: Model Id type: string required: - id - deleted - model_id title: DeploymentTombstoneV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/development/model/deploy-and-iterate.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy and iterate > Deploy your model and quickly iterate on it. In [Your First Model](/development/model/build-your-first-model), we walked through how to deploy a basic model to Baseten. If you are trying to rapidly make changes and iterate on your model, you'll notice that there is quite a bit of time between running `truss push --publish` and when the changes are reflected on Baseten. Also, a lot of models require special hardware that you may not immediately have access to. To solve this problem, we have a feature called **Truss Watch**, that allows you to live reload your model as you work. # Truss Watch To make use of `truss watch`, start by deploying your model as a development deployment: ```bash theme={"system"} $ truss push --watch ``` This will deploy a "development" version of your model with live reload enabled. The model has a live reload server attached to it and supports hot reloading. To continue the hot reload loop, simply run `truss watch` afterwards: ```bash theme={"system"} $ truss watch ``` Now, if you make changes to your model, you'll see them reflected in the model logs! You can now happily iterate on your model without having to go through the entire build & deploy loop between each change. # Ready for Production? Once you've iterated on your model, and you're ready to deploy it to production, you can use the `truss push --publish` command. This will deploy a "published" version of your model ```bash theme={"system"} truss push --publish ``` Note that development models have slightly worse performance, and have more limited scaling properties, so it's highly recommended to not use these for any production use case. --- # Source: https://docs.baseten.co/examples/deploy-your-first-model.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy your first model > Learn how to package and deploy an AI model as a production-ready API endpoint on Baseten. Deploying a model to Baseten turns your model code into a production-ready API endpoint. You package your model with [Truss](https://pypi.org/project/truss/), push it to Baseten, and receive a URL you can call from any application. This guide walks through deploying [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), a 3.8B parameter LLM, from local code to a production API. You'll create a Truss project, write model code, configure dependencies and GPU resources, deploy to Baseten, and call your model's API endpoint. ## Set up your environment Before you begin, [sign up](https://app.baseten.co/signup) or [sign in](https://app.baseten.co/login) to Baseten. ### Install Truss [Truss](https://pypi.org/project/truss/) is Baseten's model packaging framework. It handles containerization, dependencies, and deployment configuration. Using a virtual environment is recommended to avoid dependency conflicts with other Python projects. [uv](https://docs.astral.sh/uv/) is a fast Python package manager. These commands create a virtual environment, activate it, and install Truss: ```sh theme={"system"} uv venv && source .venv/bin/activate uv pip install truss ``` These commands create a virtual environment, activate it, and install Truss: ```sh theme={"system"} python -m venv .venv && source .venv/bin/activate pip install --upgrade truss ``` These commands create a virtual environment, activate it, and install Truss: ```sh theme={"system"} python -m venv .venv && .venv\Scripts\activate pip install --upgrade truss ``` New accounts include free credits; this guide should use less than \$1 in GPU costs. *** ## Create a Truss A **Truss** packages your model into a deployable container with all dependencies and configurations. Create a new Truss: ```sh theme={"system"} truss init phi-3-mini && cd phi-3-mini ``` When prompted, give your Truss a name like `Phi 3 Mini`. This command scaffolds a project with the following structure: ``` phi-3-mini/ model/ __init__.py model.py config.yaml data/ packages/ ``` The key files are: * `model/model.py`: Your model code with `load()` and `predict()` methods. * `config.yaml`: Dependencies, resources, and deployment settings. * `data/`: Optional directory for data files bundled with your model. * `packages/`: Optional directory for local Python packages. Truss uses this structure to build and deploy your model automatically. You define your model in `model.py` and your infrastructure in `config.yaml`, no Dockerfiles or container management required. *** ## Implement model code In this example, you'll implement the model code for [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct). You'll use the `transformers` library to load the model and tokenizer and PyTorch to run inference. Replace the contents of `model/model.py` with the following code: ```python model/model.py theme={"system"} import torch from transformers import AutoModelForCausalLM, AutoTokenizer class Model: def __init__(self, **kwargs): self._model = None self._tokenizer = None def load(self): self._model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", device_map="cuda", torch_dtype="auto" ) self._tokenizer = AutoTokenizer.from_pretrained( "microsoft/Phi-3-mini-4k-instruct" ) def predict(self, request): messages = request.pop("messages") model_inputs = self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256) return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)} ``` Truss models follow a three-method pattern that separates initialization from inference: | Method | When it's called | What to do here | | ---------- | ------------------------------------ | --------------------------------------------------------- | | `__init__` | Once when the class is created | Initialize variables, store configuration, set secrets | | `load` | Once at startup, before any requests | Load model weights, tokenizers, and other heavy resources | | `predict` | On every API request | Process input, run inference, return response | **Why separate `load` from `__init__`?** The `load` method runs during the container's cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path. ### Understand the request/response flow The `predict` method receives `request`, a dictionary containing the JSON body from the API call: ```python theme={"system"} # API call with: {"messages": [{"role": "user", "content": "Hello"}]} def predict(self, request): messages = request.pop("messages") # Extract from request # ... run inference ... return {"output": result} # Return dict becomes JSON response ``` Whatever dictionary you return becomes the API response. You control the input parameters and output format. ### GPU and memory patterns A few patterns in this code are common across GPU models: * **`device_map="cuda"`**: Loads model weights directly to GPU. * **`.to("cuda")`**: Moves input tensors to GPU for inference. * **`torch.no_grad()`**: Disables gradient tracking to save memory (gradients aren't needed for inference). *** ## Configure dependencies and GPU The `config.yaml` file defines your model's environment and compute resources. This configuration determines how your container is built and what hardware it runs on. ### Set Python version and dependencies ```yaml config.yaml theme={"system"} python_version: py311 requirements: - six==1.17.0 - accelerate==0.30.1 - einops==0.8.0 - transformers==4.41.2 - torch==2.3.0 ``` **Key configuration options:** | Field | Purpose | Example | | ----------------- | ---------------------------------------- | --------------------------------- | | `python_version` | Python version for your container | `py39`, `py310`, `py311`, `py312` | | `requirements` | Python packages to install (pip format) | `torch==2.3.0` | | `system_packages` | System-level dependencies (apt packages) | `ffmpeg`, `libsm6` | For the complete list of configuration options, see the [Truss reference config](/reference/truss-configuration). Always pin exact versions (e.g., `torch==2.3.0` not `torch>=2.0`). This ensures reproducible builds and your model behaves the same way every time it's deployed. ### Allocate a GPU The `resources` section specifies what hardware your model runs on: ```yaml config.yaml theme={"system"} resources: accelerator: T4 use_gpu: true ``` **Choosing the right GPU:** Match your GPU to your model's VRAM requirements. For Phi-3-mini (\~7.6GB), a T4 (16GB) provides headroom for inference. | GPU | VRAM | Good for | | ---- | ------- | ------------------------------------------- | | T4 | 16GB | Small models, embeddings, fine-tuned models | | L4 | 24GB | Medium models (7B parameters) | | A10G | 24GB | Medium models, image generation | | A100 | 40/80GB | Large models (13B-70B parameters) | | H100 | 80GB | Very large models, high throughput | **Estimating VRAM:** A rough rule is 2GB of VRAM per billion parameters for float16 models. A 7B model needs \~14GB VRAM minimum. *** ## Deploy the model ### Authenticate with Baseten First, generate an API key from the [Baseten settings](https://app.baseten.co/settings/account/api_keys). Then log in: ```sh theme={"system"} truss login ``` The expected output is: ```output theme={"system"} 💻 Let's add a Baseten remote! 🤫 Quietly paste your API_KEY: ``` Paste your API key when prompted. Truss saves your credentials for future deployments. ### Push your model to Baseten For development with live reload: ```sh theme={"system"} truss push --watch ``` The expected output is: ```output theme={"system"} Deploying truss using T4x4x16 instance type. ✨ Model Phi 3 Mini was successfully pushed ✨ 🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123 ``` When no flag is specified, `truss push` defaults to a published deployment. Use `--watch` for development deployments with live reload support. In this example, the logs URL contains two IDs: * **Model ID**: The string after `/models/` (e.g., `abc1d2ef`) which you'll use this to call the model API. * **Deployment ID**: The string after `/logs/` (e.g., `xyz123`) identifies this specific deployment. You can also find your model ID in [your Baseten dashboard](https://app.baseten.co/models/) by clicking on your model. *** ## Call the model API After the deployment is complete, you can call the model API: From your Truss project directory, run: ```sh theme={"system"} truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}' ``` The expected output is: ```output theme={"system"} Calling predict on development deployment... { "output": "AGI stands for Artificial General Intelligence..." } ``` The Truss CLI uses your saved credentials and automatically targets the correct deployment. Set your API key and replace `YOUR_MODEL_ID` with your model ID (e.g., `abc1d2ef`): ```sh theme={"system"} export BASETEN_API_KEY=YOUR_API_KEY curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/development/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}' ``` The expected output is: ```output theme={"system"} {'output': 'AGI stands for Artificial General Intelligence...'} ``` Set your API key as an environment variable, then replace `YOUR_MODEL_ID` with your model ID: ```sh theme={"system"} export BASETEN_API_KEY=YOUR_API_KEY ``` ```python main.py theme={"system"} import requests import os model_id = "YOUR_MODEL_ID" # Replace with your model ID (e.g., "abc1d2ef") baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.post( f"https://model-{model_id}.api.baseten.co/development/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "messages": [ {"role": "user", "content": "What is AGI?"} ] } ) print(resp.json()) ``` The expected output is: ```output theme={"system"} {'output': 'AGI stands for Artificial General Intelligence...'} ``` *** ## Use live reload for development To avoid long deploy times when testing changes, use **live reload**: ```sh theme={"system"} truss watch ``` The expected output is: ```output theme={"system"} 🪵 View logs for your deployment at https://app.baseten.co/models//logs/ 🚰 Attempting to sync truss with remote No changes observed, skipping patching. 👀 Watching for changes to truss... ``` When you save changes to `model.py`, Truss automatically patches the deployed model: ```output theme={"system"} Changes detected, creating patch... Created patch to update model code file: model/model.py Model Phi 3 Mini patched successfully. ``` This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server. *** ## Promote to production Once you're happy with the model, deploy it to production: ```sh theme={"system"} truss push --publish ``` This changes the API endpoint from `/development/predict` to `/production/predict`: ```sh theme={"system"} curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}' ``` To call your production endpoint, you need your model ID. The output of `truss push --publish` includes a logs URL: ```output theme={"system"} 🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123 ``` Your model ID is the string after `/models/` (e.g., `abc1d2ef`). You can also find it in your [Baseten dashboard](https://app.baseten.co/models/). *** ## Next steps Now that you've deployed your first model, continue learning: * [Model serving with Truss](/development/model/overview): Configure dependencies, secrets, and resources. * [Example implementations](https://github.com/basetenlabs/truss-examples): Deploy dozens of open source models. * [Autoscaling settings](/deployment/autoscaling): Scale GPU replicas based on demand. --- # Source: https://docs.baseten.co/development/chain/deploy.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy > Deploy your Chain on Baseten Deploying a Chain is an atomic action that deploys every Chainlet within the Chain. Each Chainlet specifies its own remote environment — hardware resources, Python and system dependencies, autoscaling settings. ### Development The default behavior for pushing a chain is to create a development deployment: ```sh theme={"system"} truss chains push ./my_chain.py ``` Where `my_chain.py` contains the entrypoint Chainlet for your Chain. Development deployments are intended for testing and can't scale past one replica. Each time you make a development deployment, it overwrites the existing development deployment. Development deployments support rapid iteration with `watch` - see [above guide](/development/chain/watch). ### 🆕 Environments To deploy a Chain to an environment, run: ```sh theme={"system"} truss chains push ./my_chain.py --environment {env_name} ``` Environments are intended for live traffic and have access to full autoscaling settings. Each time you deploy to an environment, a new deployment is created. Once the new deployment is live, it replaces the previous deployment, which is relegated to the published deployments list. [Learn more](/deployment/environments) about environments. --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-async-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async deployment > Use this endpoint to call any [published deployment](/deploy/lifecycle) of your model. ### Parameters The ID of the model you want to call. The ID of the specific deployment you want to call. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body There is a 256 KiB size limit to `/async_predict` request payloads. JSON-serializable model input. Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later. URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported. Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1). `priority` is between 0 and 2, inclusive. Maximum time a request will spend in the queue before expiring. `max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive. Exponential backoff parameters used to retry the model predict request. Number of predict request attempts. `max_attempts` must be between 1 and 10, inclusive. Minimum time between retries in milliseconds. `initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive. Maximum time between retries in milliseconds. `max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive. ### Response The ID of the async request. ### Rate limits Two types of rate limits apply when making async requests: * Calls to the `/async_predict` endpoint are limited to **200 requests per second**. * Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments. If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code. To avoid hitting these rate limits, we advise: * Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors. * Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings. ```python Python theme={"system"} import requests import os model_id = "" deployment_id = "" webhook_endpoint = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.post( f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc }, ) print(resp.json()) ``` ```sh cURL theme={"system"} curl --request POST \ --url https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict \ --header "Authorization: Api-Key $BASETEN_API_KEY" \ --data '{ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": "https://my_webhook.com/webhook", "priority": 1, "max_time_in_queue_seconds": 100, "inference_retry_config": { "max_attempts": 3, "initial_delay_ms": 1000, "max_delay_ms": 5000 } }' ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({ model_input: { prompt: "hello world!" }, webhook_endpoint: "https://my_webhook.com/webhook", priority: 1, max_time_in_queue_seconds: 100, inference_retry_config: { max_attempts: 3, initial_delay_ms: 1000, max_delay_ms: 5000, }, }), } ); const data = await resp.json(); console.log(data); ``` ```json 201 theme={"system"} { "request_id": "" } ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-async-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async chains deployment Use this endpoint to call any [deployment](/deployment/deployments) of your chain asynchronously. ```sh theme={"system"} https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote ``` ### Parameters The ID of the chain you want to call. The ID of the specific deployment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain -{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json 201 theme={"system"} { "request_id": "" } ``` --- # Source: https://docs.baseten.co/engines/performance-concepts/deployment-from-training-and-s3.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deploy training and S3 checkpoints > Deploy training checkpoints and cloud storage models with TensorRT-LLM optimization. Deploy training checkpoints and cloud storage models with Engine-Builder-LLM, BEI, or BIS-LLM. ## Training checkpoint deployment Deploy fine-tuned models from Baseten Training with Engine-Builder-LLM. Specify `BASETEN_TRAINING` as the source: ```yaml config.yaml theme={"system"} model_name: My Fine-Tuned LLM resources: accelerator: H100:1 use_gpu: true secrets: hf_access_token: null # do not set value here trt_llm: build: base_model: decoder checkpoint_repository: source: BASETEN_TRAINING repo: YOUR_TRAINING_JOB_ID revision: checkpoint-100 ``` **Key fields:** * `base_model`: `decoder` for LLMs, `encoder`/`encoder_bert` for embeddings * `source`: `BASETEN_TRAINING` for Baseten Training checkpoints * `repo`: Your training job ID * `revision`: Checkpoint folder name (e.g., `checkpoint-100`, `checkpoint-final`) Find your checkpoint details with: ```sh theme={"system"} truss train get_checkpoint_urls --job-id=YOUR_TRAINING_JOB_ID ``` ### Encoder model requirements To deploy a fine-tuned encoder model (embeddings, rerankers) from training checkpoints, use `encoder` or `encoder_bert` as the base model: ```yaml config.yaml theme={"system"} model_name: My Fine-Tuned Embeddings resources: accelerator: L4:1 use_gpu: true trt_llm: build: base_model: encoder_bert checkpoint_repository: source: BASETEN_TRAINING repo: YOUR_TRAINING_JOB_ID revision: checkpoint-final runtime: webserver_default_route: /v1/embeddings ``` Use `encoder_bert` for BERT-based models (sentence-transformers, classification, reranking). Use `encoder` for causal embedding models. Encoder models have specific requirements: * **No tensor parallelism**: Omit `tensor_parallel_count` or set it to `1`. * **Fast tokenizer required**: Your checkpoint must include a `tokenizer.json` file. Models using only the legacy `vocab.txt` format are not supported. * **Embedding model files**: For sentence-transformer models, include `modules.json` and `1_Pooling/config.json` in your checkpoint. The `webserver_default_route` configures the inference endpoint. Options include `/v1/embeddings` for embeddings, `/rerank` for rerankers, and `/predict` for classification. ## Cloud storage deployment Deploy models directly from S3, GCS, or Azure. Specify the storage source and bucket path: ```yaml config.yaml theme={"system"} trt_llm: build: base_model: decoder checkpoint_repository: source: S3 # or GCS, AZURE, HF repo: s3://your-bucket/path/to/model/ ``` **Storage sources:** * `S3`: Amazon S3 buckets * `GCS`: Google Cloud Storage * `AZURE`: Azure Blob Storage * `HF`: Hugging Face repositories ### Private storage setup All runtimes use the same downloader system as [model\_cache](/development/model/model-cache). As a result, you configure the `runtime_secret_name` and `repo` identically across model\_cache and runtimes like Engine-Builder-LLM or BEI. **Secret Setup:** Add these JSON secrets to your [Baseten secrets manager](https://app.baseten.co/settings/secrets). For more details, refer to the documentation in [model\_cache](/development/model/model-cache). **S3:** ```json theme={"system"} { "access_key_id": "XXXXX", "secret_access_key": "xxxxx/xxxxxx", "region": "us-west-2" } ``` **GCS:** ```json theme={"system"} { "private_key_id": "xxxxxxx", "private_key": "-----BEGIN PRIVATE KEY-----\nMI", "client_email": "b10-some@xxx-example.iam.gserviceaccount.com" } ``` **Azure:** ```json theme={"system"} { "account_key": "xxxxx" } ``` Reference the secret in your config: ```yaml theme={"system"} secrets: aws_secret_json: "set token in baseten workspace" trt_llm: build: checkpoint_repository: source: S3 repo: s3://your-private-bucket/model runtime_secret_name: aws_secret_json ``` **For Baseten Training deployments:** These secrets are automatically mounted and available to your deployment. ## Further reading * [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete build and runtime options for LLMs. * [BEI reference configuration](/engines/bei/bei-reference): Complete configuration for encoder models. * [Model cache documentation](/development/model/model-cache): Caching strategies used by the engines. * [Secrets management](/development/model/secrets): Configure credentials for private storage. --- # Source: https://docs.baseten.co/reference/inference-api/status-endpoints/deployment-get-async-queue-status.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async deployment > Use this endpoint to get the status of a published deployment's async queue. ### Parameters The ID of the model. The ID of the chain. The ID of the deployment. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Response The ID of the model. The ID of the deployment. The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model). The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model). ```json 200 theme={"system"} { "model_id": "", "deployment_id": "", "num_queued_requests": 12, "num_in_progress_requests": 3 } ``` ### Rate limits Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code. To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors. ```py Model theme={"system"} import requests import os model_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` ```py Chain theme={"system"} import requests import os chain_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deployment Use this endpoint to call any [published deployment](/deployment/deployments) of your model. ```sh theme={"system"} https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict ``` ### Parameters The ID of the model you want to call. The ID of the specific deployment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable model input. ```python Python theme={"system"} import urllib3 import os model_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable model input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable model input ``` ```sh Truss theme={"system"} truss predict --model-version DEPLOYMENT_ID -d '{}' # JSON-serializable model input ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({}), // JSON-serializable model input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // JSON-serializable output varies by model theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chains deployment Use this endpoint to call any [deployment](/deployment/deployments) of your chain. ```sh theme={"system"} https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote ``` ### Parameters The ID of the chain you want to call. The ID of the specific deployment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain -{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // JSON-serializable output varies by chain theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/wake/deployment-wake.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deployment Use this endpoint to wake any scaled-to-zero [deployment](/deployment/deployments) of your model. ```sh theme={"system"} https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake ``` ### Parameters The ID of the model you want to wake. The ID of the specific deployment you want to wake. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ```python Python theme={"system"} import urllib3 import os model_id = "" deployment_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake", headers={"Authorization": f"Api-Key {baseten_api_key}"}, ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake \ -H 'Authorization: Api-Key YOUR_API_KEY' \ ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // Returns a 202 response code theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-websocket.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Websocket deployment Use this endpoint to connect via WebSockets to a specific deployment. Note that `entity` here could be either `model` or `chain`, depending on whether you using Baseten models or Chains. ```sh theme={"system"} wss://{entity}-{entity_id}.api.baseten.co/deployment/{deployment_id}/websocket" ``` See [WebSockets](/development/model/websockets) for more details. ### Parameters The type of entity you want to connect to. Either `model` or `chain`. The ID of the model or chain you want to connect to. The ID of the deployment you want to connect to. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ```sh websocat theme={"system"} websocat -H 'Authorization: Api-Key YOUR_API_KEY' \ wss://{entity}-{model_id}.api.baseten.co/deployment/{deployment_id}/websocket ``` --- # Source: https://docs.baseten.co/training/deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Serving your trained model > How to deploy checkpoints from Baseten Training jobs as usable models. Baseten Training seamlessly integrates with Baseten's model deployment capabilities. Once your `TrainingJob` has produced model checkpoints, you can deploy them as fully operational model endpoints. **This feature works with HuggingFace compatible LLMs**, allowing you to easily deploy fine-tuned language models directly from your training checkpoints with a single command. For optimized inference performance with TensorRT-LLM, BEI and Baseten Inference Stack, see [Deploy checkpoints with Engine Builder](/engines/performance-concepts/deployment-from-training-and-s3). To leverage deploying checkpoints, first ensure you have a `TrainingJob` that's running with a `checkpointing_config` enabled. ```python theme={"system"} runtime = definitions.Runtime( start_commands=[ "/bin/sh -c './run.sh'", ], checkpointing_config=definitions.CheckpointingConfig( enabled=True, ), ) ``` In your training code or configuration, ensure that your checkpoints are being written to the checkpointing directory, which can be referenced via [`$BT_CHECKPOINT_DIR`](/reference/sdk/training#baseten-provided-environment-variables). The contents of this directory are uploaded to Baseten's storage and made immediately available for deployment. *(You can optionally specify a `checkpoint_path` in your `checkpointing_config` if you prefer to write to a specific directory).* The default location is "/tmp/training\_checkpoints". To deploy your checkpoint(s) as a `Deployment`, you can: ### CLI Deployment ```bash theme={"system"} truss train deploy_checkpoints [OPTIONS] ``` **Options:** * `--job-id` (TEXT): Job ID to deploy checkpoints from. If not specified, deploys from the most recent training job. This will deploy the most recent checkpoint from your training job as an inference endpoint. ### UI Deployment You can also deploy checkpoints directly from the Baseten UI by pressing the dropdown menu on your completed training job and selecting "Deploy" on your selected checkpoint. ### Advanced CLI Deployment You can also: * run `truss train deploy_checkpoints [--job-id ]` and follow the setup wizard. * define an instance of a `DeployCheckpointsConfig` class (this is helpful for small changes that aren't provided by the wizard) and run `truss train deploy_checkpoints --config `. Currently, the `deploy_checkpoints` command only supports LoRA and Full Fine Tune for Single Node LLM Training jobs. When `deploy_checkpoints` is run, `truss` will construct a deployment `config.yml` and store it on disk in a temporary directory. If you'd like to preserve or modify the resulting deployment config, you can copy paste it into a permanent directory and customize it as needed. This file defines the source of truth for the deployment and can be deployed independently via `truss push`. See [deployments](../deployment/deployments) for more details. After successful deployment, your model will be deployed on Baseten, where you can run inference requests and evaluate performance. See [Calling Your Model](/inference/calling-your-model) for more details. To download the files you saved to the checkpointing directory or understand the file structure, you can run `truss train get_checkpoint_urls [--job-id=]` to get a JSON file containing presigned URLs for each training job. The JSON file contains the following structure: ```json theme={"system"} { "timestamp": "2025-06-23T13:44:16.485905+00:00", "job": { "id": "03yv1l3", "created_at": "2025-06-18T14:30:30.480Z", "current_status": "TRAINING_JOB_COMPLETED", "error_message": null, "instance_type": { "id": "H100:2x8x176x968", "name": "H100:2x8x176x968 - 2 Nodes of 8 H100 GPUs, 640 GiB VRAM, 176 vCPUs, 968 GiB RAM", "memory_limit_mib": 967512, "millicpu_limit": 176000, "gpu_count": 8, "gpu_type": "H100", "gpu_memory_limit_mib": 655360 }, "updated_at": "2025-06-18T14:30:30.510Z", "training_project_id": "lqz9o34", "training_project": { "id": "lqz9o34", "name": "checkpointing" } }, "checkpoint_artifacts": [ { "url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-0/checkpoint-24/tokenizer_config.json?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056", "relative_file_name": "checkpoint-24/tokenizer_config.json", "node_rank": 0 } ... ] } ``` **Important notes about the presigned URLs:** * The presigned URLs expire after **7 days** from generation * These URLs are primarily intended for **evaluation and testing purposes**, not for long-term inference deployments * For production deployments, consider copying the checkpoint files to your Truss model directory and downloading them in the model's `load()` function ## Complex and Custom Use Cases * Custom Model Architectures * Weights Sharded Across Nodes (Contact Baseten for help implementing this) Examine the structure of your files with `truss train get_checkpoint_urls --job-id=`. If a file looks like this: ```json theme={"system"} { "url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-4/checkpoint-10/weights.safetensors?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056", "relative_file_name": "checkpoint-10/weights.safetensors", "node_rank": 4 } ``` In your Truss configuration, add a section like this: Wildcards `*` match to an arbitrary number of chars while `?` matches to one. ```yaml theme={"system"} training_checkpoints: download_folder: /tmp/training_checkpoints artifact_references: - training_job_id: paths: - rank-*/checkpoint-10/ # Pull in all the files for checkpoint-10 across all nodes ``` When your model pod starts up, you can read the file from the path `/tmp/training_checkpoints/rank-[node-rank]/[relative_file_name]`. For the example above, the file can be read from: ``` /tmp/training_checkpoints//rank-4/checkpoint-10/weights.safetensors ``` --- # Source: https://docs.baseten.co/troubleshooting/deployments.md # Source: https://docs.baseten.co/deployment/deployments.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deployments > Deploy, manage, and scale machine learning models with Baseten A **deployment** in Baseten is a **containerized instance of a model** that serves inference requests via an API endpoint. Deployments exist independently but can be **promoted to an environment** for structured access and scaling. Every deployment is **automatically wrapped in a REST API**. Once deployed, models can be queried with a simple HTTP request: ```python theme={"system"} import requests resp = requests.post( "https://model-{modelID}.api.baseten.co/deployment/[{deploymentID}]/predict", headers={"Authorization": "Api-Key YOUR_API_KEY"}, json={'text': 'Hello my name is {MASK}'}, ) print(resp.json()) ``` [Learn more about running inference on your deployment](/inference/calling-your-model) *** # Development deployment A **development deployment** is a mutable instance designed for rapid iteration. It is always in the **development state** and cannot be renamed or detached from it. Key characteristics: * **Live reload** enables direct updates without redeployment. * **Single replica, scales to zero** when idle to conserve compute resources. * **No autoscaling or zero-downtime updates.** * **Can be promoted** to create a persistent deployment. Once promoted, the development deployment transitions to a **deployment** and can optionally be promoted to an environment. *** # Environments and promotion Environments provide **logical isolation** for managing deployments but are **not required** for a deployment to function. A deployment can be executed independently or promoted to an environment for controlled traffic allocation and scaling. * The **production environment** exists by default. * **Custom environments** (e.g., staging) can be created for specific workflows. * **Promoting a deployment does not modify its behavior**, only its routing and lifecycle management. ## Canary deployments Canary deployments support **incremental traffic shifting** to a new deployment, mitigating risk during rollouts. * Traffic is routed in **10 evenly distributed stages** over a configurable time window. * Traffic only begins to shift once the new deployment reaches the min replica count of the current production model. * Autoscaling dynamically adjusts to real-time demand. * Canary rollouts can be enabled or canceled via the UI or [REST API](/reference/management-api/environments/update-an-environments-settings). *** # Managing Deployments ## Naming deployments By default, deployments of a model are named `deployment-1`, `deployment-2`, and so forth sequentially. You can instead give deployments custom names via two methods: 1. While creating the deployment, using a [command line argument in truss push](/reference/sdk/truss#deploying-a-model). 2. After creating the deployment, in the model management page within your Baseten dashboard. Renaming deployments is purely aesthetic and does not affect model management API paths, which work via model and deployment IDs. ## Deactivating a deployment A deployment can be deactivated to suspend inference execution while preserving configuration. * **Remains visible in the dashboard.** * **Consumes no compute resources** but can be reactivated anytime. * **API requests return a 404 error while deactivated.** For demand-driven deployments, consider [autoscaling with scale to zero](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). ## Deleting deployments Deployments can be **permanently deleted**, but production deployments must be replaced before deletion. * **Deleted deployments are purged from the dashboard** but retained in usage logs. * **All associated compute resources are released.** * **API requests return a 404 error post-deletion.** Deletion is irreversible — use deactivation if retention is required. --- # Source: https://docs.baseten.co/development/model-apis/deprecation.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Deprecation > Baseten's deprecation policy for Model APIs As open source models advance rapidly, Baseten prioritizes serving the highest quality models and deprecates specific Model APIs when stronger alternatives are available. When a model is selected for deprecation, Baseten follows this process: 1. **Announcement** * Deprecations are announced approximately two weeks before the deprecation date. * Documentation is updated to identify the model being deprecated and recommend a replacement. * Affected users are contacted via email. 2. **Transition** * The deprecated model remains fully functional until the deprecation date. You have approximately two weeks to transition using one of these options: 1. Migrate to a dedicated deployment with the deprecated model weights. [Contact us](https://www.baseten.co/talk-to-us/deprecation-inquiry/) for assistance. 2. Update your code to use an active model (a recommendation is provided in the deprecation announcement). 3. **Deprecation date** * The model ID for the deprecated model becomes inactive and returns an error for all requests. * A changelog notification is published with the recommended replacement. ## Planned deprecations | Deprecation Date | Model | Recommended Replacement | Dedicated Available | | :--------------- | :----------------------------- | :------------------------------------------------- | :-----------------: | | 2026-2-06 | Qwen3 Coder 480B A35B Instruct | [GLM 4.7](https://www.baseten.co/library/glm-4-7/) | ✅ | --- # Source: https://docs.baseten.co/development/chain/design.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Architecture and design > How to structure your Chainlets A Chain is composed of multiple connected Chainlets working together to perform a task. For example, the Chain in the diagram below takes a large audio file as input. Then it splits it into smaller chunks, transcribes each chunk in parallel (reducing the end-to-end latency), and finally aggregates and returns the results. To build an efficient Chain, we recommend drafting your high level structure as a flowchart or diagram. This can help you identifying parallelizable units of work and steps that need different (model/hardware) resources. If one Chainlet creates many "sub-tasks" by calling other dependency Chainlets (e.g. in a loop over partial work items), these calls should be done as `aynscio`-tasks that run concurrently. That way you get the most out of the parallelism that Chains offers. This design pattern is extensively used in the [audio transcription example](/examples/chains-audio-transcription). While using `asyncio` is essential for performance, it can also be tricky. Here are a few caveats to look out for: * Executing operations in an async function that block the event loop for more than a fraction of a second. This hinders the "flow" of processing requests concurrently and starting RPCs to other Chainlets. Ideally use native async APIs. Frameworks like vLLM or triton server offer such APIs, similarly file downloads can be made async and you might find [`AsyncBatcher`](https://github.com/hussein-awala/async-batcher) useful. If there is no async support, consider running blocking code in a thread/process pool (as an attribute of a Chainlet). * Creating async tasks (e.g. with `asyncio.ensure_future`) does not start the task *immediately*. In particular, when starting several tasks in a loop, `ensure_future` must be alternated with operations that yield to the event loop that, so the task can be started. If the loop is not `async for` or contains other `await` statements, a "dummy" await can be added, for example `await asyncio.sleep(0)`. This allows the tasks to be started concurrently. --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-async-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async development > Use this endpoint to call the [development deployment](/deploy/lifecycle) of your model asynchronously. ### Parameters The ID of the model you want to call. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body There is a 256 KiB size limit to `/async_predict` request payloads. JSON-serializable model input. Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later. URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported. Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1). `priority` is between 0 and 2, inclusive. Maximum time a request will spend in the queue before expiring. `max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive. Exponential backoff parameters used to retry the model predict request. Number of predict request attempts. `max_attempts` must be between 1 and 10, inclusive. Minimum time between retries in milliseconds. `initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive. Maximum time between retries in milliseconds. `max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive. ### Response The ID of the async request. ### Rate limits Two types of rate limits apply when making async requests: * Calls to the `/async_predict` endpoint are limited to **200 requests per second**. * Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments. If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code. To avoid hitting these rate limits, we advise: * Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors. * Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings. ```python Python theme={"system"} import requests import os model_id = "" webhook_endpoint = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.post( f"https://model-{model_id}.api.baseten.co/development/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc }, ) print(resp.json()) ``` ```sh cURL theme={"system"} curl --request POST \ --url https://model-{model_id}.api.baseten.co/development/async_predict \ --header "Authorization: Api-Key $BASETEN_API_KEY" \ --data '{ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": "https://my_webhook.com/webhook", "priority": 1, "max_time_in_queue_seconds": 100, "inference_retry_config": { "max_attempts": 3, "initial_delay_ms": 1000, "max_delay_ms": 5000 } }' ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/development/async_predict", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({ model_input: { prompt: "hello world!" }, webhook_endpoint: "https://my_webhook.com/webhook", priority: 1, max_time_in_queue_seconds: 100, inference_retry_config: { max_attempts: 3, initial_delay_ms: 1000, max_delay_ms: 5000, }, }), } ); const data = await resp.json(); console.log(data); ``` ```json 201 theme={"system"} { "request_id": "" } ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-async-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async chains development Use this endpoint to call the [development deployment](/development/chain/deploy#development) of your chain asynchronously. ```sh theme={"system"} https://chain-{chain_id}.api.baseten.co/development/async_run_remote ``` ### Parameters The ID of the chain you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain-{chain_id}.api.baseten.co/development/async_run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/development/async_run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require('node-fetch'); const resp = await fetch( 'https://chain-{chain_id}.api.baseten.co/development/async_run_remote', { method: 'POST', headers: { Authorization: 'Api-Key YOUR_API_KEY' }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json 201 theme={"system"} { "request_id": "" } ``` --- # Source: https://docs.baseten.co/reference/inference-api/status-endpoints/development-get-async-queue-status.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async development > Use this endpoint to get the status of a development deployment's async queue. ### Parameters The ID of the model. The ID of the chain. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Response The ID of the model. The ID of the deployment. The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model). The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model). ```json 200 theme={"system"} { "model_id": "", "deployment_id": "", "num_queued_requests": 12, "num_in_progress_requests": 3 } ``` ### Rate limits Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code. To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors. ```py Model theme={"system"} import requests import os model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://model-{model_id}.api.baseten.co/development/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` ```py Chain theme={"system"} import requests import os chain_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://chain-{chain_id}.api.baseten.co/development/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Development Use this endpoint to call the [development deployment](/deployment/deployments) of your model. ```sh theme={"system"} https://model-{model_id}.api.baseten.co/development/predict ``` ### Parameters The ID of the model you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable model input. ```python Python theme={"system"} import urllib3 import os model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://model-{model_id}.api.baseten.co/development/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable model input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://model-{model_id}.api.baseten.co/development/predict \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable model input ``` ```sh Truss theme={"system"} truss predict --model-version DEPLOYMENT_ID -d '{}' # JSON-serializable model input ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/development/predict", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({}), // JSON-serializable model input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // JSON-serializable output varies by model theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chains development Use this endpoint to call the [development deployment](/development/chain/deploy#development) of your chain. ```sh theme={"system"} https://chain-{chain_id}.api.baseten.co/development/run_remote ``` ### Parameters The ID of the chain you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain-{chain_id}.api.baseten.co/development/run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/development/run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require('node-fetch'); const resp = await fetch( 'https://chain-{chain_id}.api.baseten.co/development/run_remote', { method: 'POST', headers: { Authorization: 'Api-Key YOUR_API_KEY' }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response theme={"system"} // JSON-serializable output varies by chain {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/wake/development-wake.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Development Use this endpoint to wake the [development deployment](/deployment/deployments#development-deployment) of your model if it is scaled to zero. ```sh theme={"system"} https://model-{model_id}.api.baseten.co/development/wake ``` ### Parameters The ID of the model you want to wake. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ```python Python theme={"system"} import urllib3 import os model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://model-{model_id}.api.baseten.co/development/wake", headers={"Authorization": f"Api-Key {baseten_api_key}"}, ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://model-{model_id}.api.baseten.co/development/wake \ -H 'Authorization: Api-Key YOUR_API_KEY' \ ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/development/wake", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // Returns a 202 response code theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-websocket.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Websocket development Use this endpoint to connect via WebSockets to the development deployment of a model or chain. ```sh theme={"system"} wss://{entity}-{entity_id}.api.baseten.co/development/websocket" ``` See [WebSockets](/development/model/websockets) for more details. ### Parameters The type of entity you want to connect to. Either `model` or `chain`. The ID of the model or chain you want to connect to. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ```sh websocat theme={"system"} websocat -H 'Authorization: Api-Key YOUR_API_KEY' \ wss://{entity}-{entity_id}.api.baseten.co/development/websocket ``` --- # Source: https://docs.baseten.co/examples/docker.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Dockerized model > Deploy any model in a pre-built Docker container In this example, we deploy a dockerized model for [infinity embedding server](https://github.com/michaelfeil/infinity), a high-throughput, low-latency REST API server for serving vector embeddings. # Setting up the `config.yaml` To deploy a dockerized model, all you need is a `config.yaml`. It specifies how to build your Docker image, start the server, and manage resources. Let’s break down each section. ## Base Image Sets the foundational Docker image to a lightweight Python 3.11 environment. ```yaml config.yaml theme={"system"} base_image: image: python:3.11-slim ``` ## Docker Server Configuration Configures the server's startup command, health check endpoints, prediction endpoint, and the port on which the server will run. ```yaml config.yaml theme={"system"} docker_server: start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) infinity_emb v2 --batch-size 64 --model-id BAAI/bge-small-en-v1.5 --revision main" readiness_endpoint: /health liveness_endpoint: /health predict_endpoint: /embeddings server_port: 7997 ``` ## Build Commands (Optional) Pre-downloads model weights during the build phase to ensure the model is ready at container startup. ```yaml config.yaml theme={"system"} build_commands: # optional step to download the weights of the model into the image - sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) infinity_emb v2 --preload-only --no-model-warmup --model-id BAAI/bge-small-en-v1.5 --revision main" ``` ## Configure resources Note that we need an L4 to run this model. ```yaml config.yaml theme={"system"} resources: accelerator: L4 use_gpu: true ``` ## Requirements Lists the Python package dependencies required for the infinity embedding server. ```yaml config.yaml theme={"system"} requirements: - infinity-emb[all]==0.0.72 ``` ## Runtime Settings Sets the server to handle up to 40 concurrent inferences to manage load efficiently. ```yaml config.yaml theme={"system"} runtime: predict_concurrency: 40 ``` ## Environment Variables Defines essential environment variables including the Hugging Face access token, request batch size, queue size limit, and a flag to disable tracking. ```yaml config.yaml theme={"system"} environment_variables: hf_access_token: null # constrain api to at most 256 sentences per request, for better load-balancing INFINITY_MAX_CLIENT_BATCH_SIZE: 256 # constrain model to a max backpressure of INFINITY_MAX_CLIENT_BATCH_SIZE * predict_concurrency = 10241 requests INFINITY_QUEUE_SIZE: 10241 DO_NOT_TRACK: 1 ``` # Deploy dockerized model Deploy the model like you would other Trusses, with: ```bash theme={"system"} truss push infinity-embedding-server --publish ``` --- # Source: https://docs.baseten.co/reference/training-api/download-training-job.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Download training job source code > Get the uploaded training job as a S3 Artifact ## OpenAPI ````yaml get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/download openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}/download: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' get: summary: Get the uploaded training job as a S3 Artifact description: Get the uploaded training job as a S3 Artifact responses: '200': description: A response that includes the artifacts for a training job content: application/json: schema: $ref: '#/components/schemas/DownloadTrainingJobResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: DownloadTrainingJobResponseV1: description: A response that includes the artifacts for a training job properties: artifact_presigned_urls: description: Presigned URL's for the artifacts items: type: string title: Artifact Presigned Urls type: array required: - artifact_presigned_urls title: DownloadTrainingJobResponseV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/engines/engine-builder-llm/engine-builder-config.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Reference config (Engine-Builder-LLM) > Complete reference config for dense text generation models This reference covers all build and runtime options for Engine-Builder-LLM deployments. All settings use the `trt_llm` section in `config.yaml`. ## Configuration structure ```yaml theme={"system"} trt_llm: inference_stack: v1 # Always v1 for Engine-Builder-LLM build: base_model: decoder checkpoint_repository: {...} max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 8192 quantization_type: no_quant | fp8 | fp8_kv | fp4 | fp4_kv | fp4_mlp_only quantization_config: {...} tensor_parallel_count: 1 plugin_configuration: {...} speculator: {...} # Optional for lookahead decoding runtime: kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict served_model_name: "model-name" total_token_limit: 500000 ``` ## Build configuration The `build` section configures model compilation and optimization settings. The base model architecture for your model checkpoint. **Options:** * `decoder`: For CausalLM models (Llama, Mistral, Qwen, Gemma, Phi) ```yaml theme={"system"} build: base_model: decoder ``` Specifies where to find the model checkpoint. Repository must be a valid Hugging Face model repository with the standard structure (config.json, tokenizer files, model weights). **Source options:** * `HF`: Hugging Face Hub (default) * `GCS`: Google Cloud Storage * `S3`: AWS S3 * `AZURE`: Azure Blob Storage * `REMOTE_URL`: HTTP URL to tar.gz file * `BASETEN_TRAINING`: Baseten Training checkpoints For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3). ```yaml theme={"system"} checkpoint_repository: source: HF repo: "meta-llama/Llama-3.3-70B-Instruct" revision: main runtime_secret_name: hf_access_token ``` Maximum sequence length (context) for single requests. Range: 1 to 1048576. ```yaml theme={"system"} build: max_seq_len: 131072 # 128K context ``` Maximum number of input sequences processed concurrently. Range: 1 to 2048. Unless lookahead decoding is enabled, this parameter has little effect on performance. Keep it at 256 for most cases. Recommended not to be set below 8 to keep performance dynamic for various problems. ```yaml theme={"system"} build: max_batch_size: 256 ``` Maximum number of batched input tokens after padding removal in each batch. Range: 256 to 131072, must be multiple of 64. If `enable_chunked_prefill: false`, this also limits the `max_seq_len` that can be processed. Recommended: `8192` or `16384`. ```yaml theme={"system"} build: max_num_tokens: 16384 ``` Specifies the quantization format for model weights. **Options:** * `no_quant`: `FP16`/`BF16` precision * `fp8`: `FP8` weights + 16-bit KV cache * `fp8_kv`: `FP8` weights + `FP8` KV cache * `fp4`: `FP4` weights + 16-bit KV cache (B200 only) * `fp4_kv`: `FP4` weights + `FP8` KV cache (B200 only) * `fp4_mlp_only`: `FP4` MLP only + 16-bit KV (B200 only) For detailed quantization guidance, see [Quantization Guide](/engines/performance-concepts/quantization-guide). ```yaml theme={"system"} build: quantization_type: fp8_kv ``` Configuration for post-training quantization calibration. **Fields:** * `calib_size`: Size of calibration dataset (64-16384, multiple of 64). Defines how many rows of the train split with text column to take. * `calib_dataset`: HuggingFace dataset for calibration. Dataset must have 'text' column (str type) for samples, or 'train' split as subsection. * `calib_max_seq_length`: Maximum sequence length for calibration. ```yaml theme={"system"} build: quantization_type: fp8 quantization_config: calib_size: 1536 calib_dataset: "cnn_dailymail" calib_max_seq_length: 1024 ``` Number of GPUs to use for tensor parallelism. Range: 1 to 8. ```yaml theme={"system"} build: tensor_parallel_count: 4 # For 70B+ models ``` TensorRT-LLM plugin configuration for performance optimization. **Fields:** * `paged_kv_cache`: Enable paged KV cache (recommended: true) * `use_paged_context_fmha`: Enable paged context FMHA (recommended: true) * `use_fp8_context_fmha`: Enable `FP8` context FMHA (requires `FP8_KV` quantization) ```yaml theme={"system"} build: plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: true # For FP8_KV quantization ``` Configuration for speculative decoding with lookahead. For detailed configuration, see [Lookahead decoding](/engines/engine-builder-llm/lookahead-decoding). **Fields:** * `speculative_decoding_mode`: `LOOKAHEAD_DECODING` (recommended) * `lookahead_windows_size`: Window size for speculation (1-8) * `lookahead_ngram_size`: N-gram size for patterns (1-16) * `lookahead_verification_set_size`: Verification buffer size (1-8) * `enable_b10_lookahead`: Enable Baseten's lookahead algorithm ```yaml theme={"system"} build: speculator: speculative_decoding_mode: LOOKAHEAD_DECODING lookahead_windows_size: 3 lookahead_ngram_size: 8 lookahead_verification_set_size: 3 enable_b10_lookahead: true ``` Number of GPUs to use during the build job. Only set this if you encounter errors during the build job. It has no impact once the model reaches the deploying stage. If not set, equals `tensor_parallel_count`. ```yaml theme={"system"} build: num_builder_gpus: 2 ``` ## Runtime configuration The `runtime` section configures inference engine behavior. Fraction of GPU memory to reserve for KV cache. Range: 0.1 to 1.0. ```yaml theme={"system"} runtime: kv_cache_free_gpu_mem_fraction: 0.85 ``` Enable chunked prefilling for long sequences. ```yaml theme={"system"} runtime: enable_chunked_context: true ``` Policy for scheduling requests in batches. **Options:** * `max_utilization`: Maximize GPU utilization (may evict requests) * `guaranteed_no_evict`: Guarantee request completion (recommended) ```yaml theme={"system"} runtime: batch_scheduler_policy: guaranteed_no_evict ``` Model name returned in API responses. ```yaml theme={"system"} runtime: served_model_name: "Llama-3.3-70B-Instruct" ``` Maximum number of tokens that can be scheduled at once. Range: 1 to 1000000. ```yaml theme={"system"} runtime: total_token_limit: 1000000 ``` ## Configuration examples ### Llama 3.3 70B ```yaml theme={"system"} model_name: Llama-3.3-70B-Instruct resources: accelerator: H100:4 cpu: '4' memory: 40Gi use_gpu: true trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "meta-llama/Llama-3.3-70B-Instruct" revision: main runtime_secret_name: hf_access_token max_seq_len: 131072 max_batch_size: 256 max_num_tokens: 8192 quantization_type: fp8_kv tensor_parallel_count: 4 plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: true quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 runtime: kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict served_model_name: "Llama-3.3-70B-Instruct" ``` ### Qwen 2.5 32B with lookahead decoding ```yaml theme={"system"} model_name: Qwen-2.5-32B-Lookahead resources: accelerator: H100:2 cpu: '2' memory: 20Gi use_gpu: true trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "Qwen/Qwen2.5-32B-Instruct" revision: main max_seq_len: 32768 max_batch_size: 128 max_num_tokens: 8192 quantization_type: fp8_kv tensor_parallel_count: 2 speculator: speculative_decoding_mode: LOOKAHEAD_DECODING lookahead_windows_size: 3 lookahead_ngram_size: 8 lookahead_verification_set_size: 3 enable_b10_lookahead: true plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: true runtime: kv_cache_free_gpu_mem_fraction: 0.85 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict served_model_name: "Qwen-2.5-32B-Instruct" ``` ### Small model on L4 ```yaml theme={"system"} model_name: Llama-3.2-3B-Instruct resources: accelerator: L4 cpu: '1' memory: 10Gi use_gpu: true trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "meta-llama/Llama-3.2-3B-Instruct" revision: main max_seq_len: 8192 max_batch_size: 256 max_num_tokens: 4096 quantization_type: fp8 tensor_parallel_count: 1 plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: false runtime: kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict served_model_name: "Llama-3.2-3B-Instruct" ``` ### B200 with `FP4` quantization ```yaml theme={"system"} model_name: Qwen-2.5-32B-FP4 resources: accelerator: B200 cpu: '2' memory: 20Gi use_gpu: true trt_llm: build: base_model: decoder checkpoint_repository: source: HF repo: "Qwen/Qwen2.5-32B-Instruct" revision: main max_seq_len: 32768 max_batch_size: 256 max_num_tokens: 8192 quantization_type: fp4_kv tensor_parallel_count: 1 plugin_configuration: paged_kv_cache: true use_paged_context_fmha: true use_fp8_context_fmha: true quantization_config: calib_size: 1024 calib_dataset: "cnn_dailymail" calib_max_seq_length: 2048 runtime: kv_cache_free_gpu_mem_fraction: 0.9 enable_chunked_context: true batch_scheduler_policy: guaranteed_no_evict served_model_name: "Qwen-2.5-32B-Instruct" ``` ## Validation and troubleshooting ### Common errors **Error:** `FP8 quantization is only supported on L4, H100, H200, B200` * **Cause:** Using `FP8` quantization on unsupported GPU. * **Fix:** Use H100 or newer GPU, or use `no_quant`. **Error:** `FP4 quantization is only supported on B200` * **Cause:** Using `FP4` quantization on unsupported GPU. * **Fix:** Use B200 GPU or `FP8` quantization. **Error:** `Using fp8 context fmha requires fp8 kv, or fp4 with kv cache dtype` * **Cause:** Mismatch between quantization and context FMHA settings. * **Fix:** Use `fp8_kv` quantization or disable `use_fp8_context_fmha`. **Error:** `Tensor parallelism and GPU count must be the same` * **Cause:** Mismatch between `tensor_parallel_count` and GPU count. * **Fix:** Ensure `tensor_parallel_count` matches `accelerator` count. ### Performance tuning **For lowest latency:** * Reduce `max_batch_size` and `max_num_tokens`. * Use `batch_scheduler_policy: guaranteed_no_evict`. * Consider smaller models or quantization. **For highest throughput:** * Increase `max_batch_size` and `max_num_tokens`. * Use `batch_scheduler_policy: max_utilization`. * Enable quantization on supported hardware. **For cost optimization:** * Use L4 GPUs with `FP8` quantization. * Choose appropriately sized models. * Tune `max_seq_len` to your actual requirements. ## Model repository structure All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running: ```bash theme={"system"} git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ``` ### Required files **Model configuration (`config.json`):** * `max_position_embeddings`: Limits maximum context size (content beyond this is truncated). * `vocab_size`: Vocabulary size for the model. * `architectures`: Must include `LlamaForCausalLM`, `MistralForCausalLM`, or similar causal LM architectures. Custom code is typically not read. * `torch_dtype`: Default inference dtype (`float16` or `bfloat16`). Cannot be a pre-quantized model. **Model weights (`model.safetensors`):** * Or: `model.safetensors.index.json` + `model-xx-of-yy.safetensors` (sharded). * Convert to safetensors if you encounter issues with other formats. * Cannot be a pre-quantized model. Model must be an `fp16`, `bf16`, or `fp32` checkpoint. **Tokenizer files (`tokenizer_config.json` and `tokenizer.json`):** * For maximum compatibility, use "FAST" tokenizers compatible with Rust. * Cannot contain custom Python code. * For chat completions: must contain `chat_template`, a Jinja2 template. ### Architecture support | **Model family** | **Supported architectures** | **Notes** | | ---------------- | -------------------------------------- | --------------------------------------------------- | | **Llama** | `LlamaForCausalLM` | Full support for Llama 3. For Llama 4, use BIS-LLM. | | **Mistral** | `MistralForCausalLM` | Including v0.3 and Small variants. | | **Qwen** | `Qwen2ForCausalLM`, `Qwen3ForCausalLM` | Including Qwen 2.5 and Qwen 3 series. | | **QwenMoE** | `Qwen3MoEForCausalLM` | Specfic support for Qwen3MoE. | | **Gemma** | `GemmaForCausalLM` | Including Gemma 2 and Gemma 3 series, bf16 only. | ## Best practices ### Model size and GPU selection | **Model size** | **Recommended GPU** | **Quantization** | **Tensor parallel** | | -------------- | ------------------- | ---------------- | ------------------- | | `<8B` | L4/H100 | `FP8_KV` | 1 | | 8B-70B | H100 | `FP8_KV` | 1-2 | | 70B+ | H100/B200 | `FP8_KV`/`FP4` | 4+ | ### Production recommendations * Use `quantization_type: fp8_kv` for best performance/accuracy balance. * Set `max_batch_size` based on your expected traffic patterns. * Enable `paged_kv_cache` and `use_paged_context_fmha` for optimal performance. ### Development recommendations * Use `quantization_type: no_quant` for fastest iteration. * Set smaller `max_seq_len` to reduce build time. * Use `batch_scheduler_policy: guaranteed_no_evict` for predictable behavior. --- # Source: https://docs.baseten.co/development/model/performance/engine-builder-customization.md # Engine control in Python > Use `model.py` to customize engine behavior When you create a new Truss with `truss init`, it creates two files: `config.yaml` and `model/model.py`. While you configure the Engine Builder in `config.yaml`, you may use `model/model.py` to access and control the engine object during inference. You have two options: 1. Delete the `model/model.py` file and your TensorRT-LLM engine will run according to its base spec. 2. Update the code to support TensorRT-LLM. You must either update `model/model.py` to pass `trt_llm` as an argument to the `__init__` method OR delete the file. Otherwise you will get an error on deployment as the default `model/model.py` file is not written for TensorRT-LLM. The `engine` object is a property of the `trt_llm` argument and must be initialized in `__init__` to be accessed in `load()` (which runs once on server start-up) and `predict()` (which runs for each request handled by the server). This example applies a chat template with the Llama 3.1 8B tokenizer to the model prompt: ```python model/model.py theme={"system"} import orjson # faster serialization/deserialization than built-in json from typing import Any, AsyncIterator from transformers import AutoTokenizer from fastapi.responses import StreamingResponse SSE_PREFIX = "data: " class Model: def __init__(self, trt_llm, **kwargs) -> None: self._secrets = kwargs["secrets"] self._engine = trt_llm["engine"] self._model = None self._tokenizer = None def load(self) -> None: self._tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", token=self._secrets["hf_access_token"]) async def predict(self, model_input: Any) -> Any: # Apply chat template to prompt model_input["prompt"] = self._tokenizer.apply_chat_template(model_input["prompt"], tokenize=False) response = await self._engine.predict(model_input) # If the response is streaming, post-process each chunk if isinstance(response, StreamingResponse): token_gen = response.body_iterator async def processed_stream(): async for chunk in some_post_processing_function(token_gen): yield chunk return StreamingResponse(processed_stream(), media_type="text/event-stream") # Otherwise, return the raw output else: return response # --- Post-processing helpers for SSE --- def parse_sse_chunk(chunk: bytes) -> dict | None: """Parses an SSE-formatted chunk and returns the JSON payload.""" try: text = chunk.decode("utf-8").strip() if not text.startswith(SSE_PREFIX): return None return orjson.loads(text[len(SSE_PREFIX):]) except Exception: return None def format_sse_chunk(payload: dict) -> bytes: """Formats a JSON payload back into an SSE chunk.""" return f"{SSE_PREFIX}".encode("utf-8") + orjson.dumps(payload) + b"\n\n" def transform_payload(payload: dict) -> dict: """Add a new field to the SSE payload.""" payload["my_new_field"] = "my_new_value" return payload async def some_post_processing_function( token_gen: AsyncIterator[bytes] ) -> AsyncIterator[bytes]: """Post-process each SSE chunk in the stream.""" async for chunk in token_gen: payload = parse_sse_chunk(chunk) if payload is None: yield chunk continue transformed = transform_payload(payload) yield format_sse_chunk(transformed) ``` --- # Source: https://docs.baseten.co/development/chain/engine-builder-models.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Engine-Builder LLM Models > Engine-Builder LLM models are pre-trained models that are optimized for specific inference tasks. Baseten's [Engine-Builder](/engines/engine-builder-llm/overview) enables the deployment of optimized model inference engines. Currently, it supports TensorRT-LLM. Truss Chains allows seamless integration of these engines into structured workflows. This guide provides a quick entry point for Chains users. ## LLama 7B Example Use the `EngineBuilderLLMChainlet` baseclass to configure an LLM engine. The additional `engine_builder_config` field specifies model architecture, repository, and engine parameters and more, the full options are detailed in the [Engine-Builder configuration guide](/engines/engine-builder-llm/engine-builder-config). ```python theme={"system"} import truss_chains as chains from truss.base import trt_llm_config, truss_config class Llama7BChainlet(chains.EngineBuilderLLMChainlet): remote_config = chains.RemoteConfig( compute=chains.Compute(gpu=truss_config.Accelerator.H100), assets=chains.Assets(secret_keys=["hf_access_token"]), ) engine_builder_config = truss_config.TRTLLMConfiguration( build=trt_llm_config.TrussTRTLLMBuildConfiguration( base_model=trt_llm_config.TrussTRTLLMModel.LLAMA, checkpoint_repository=trt_llm_config.CheckpointRepository( source=trt_llm_config.CheckpointSource.HF, repo="meta-llama/Llama-3.1-8B-Instruct", ), max_batch_size=8, max_seq_len=4096, tensor_parallel_count=1, ) ) ``` ## Differences from Standard Chainlets * No `run_remote` implementation: Unlike regular Chainlets, `EngineBuilderLLMChainlet` does not require users to implement `run_remote()`. Instead, it automatically wires into the deployed engine’s API. All LLM Chainlets have the same function signature: `chains.EngineBuilderLLMInput` as input and a stream (`AsyncIterator`) of strings as output. Likewise `EngineBuilderLLMChainlet`s can only be used as dependencies, but not have dependencies themselves. * No `run_local` ([guide](/development/chain/localdev)) and `watch` ([guide](/development/chain/watch)) Standard Chains support a local debugging mode and watch. However, when using `EngineBuilderLLMChainlet`, local execution is not available, and testing must be done after deployment. For a faster dev loop of the rest of your chain (everything except the engine builder chainlet) you can substitute those chainlets with stubs like you can do for an already deployed truss model \[[guide](/development/chain/stub)]. ## Integrate the Engine-Builder Chainlet After defining an `EngineBuilderLLMInput` like `Llama7BChainlet` above, you can use it as a dependency in other conventional chainlets: ```python theme={"system"} from typing import AsyncIterator import truss_chains as chains @chains.mark_entrypoint class TestController(chains.ChainletBase): """Example using the Engine-Builder Chainlet in another Chainlet.""" def __init__(self, llm=chains.depends(Llama7BChainlet)) -> None: self._llm = llm async def run_remote(self, prompt: str) -> AsyncIterator[str]: messages = [{"role": "user", "content": prompt}] llm_input = chains.EngineBuilderLLMInput(messages=messages) async for chunk in self._llm.run_remote(llm_input): yield chunk ``` --- # Source: https://docs.baseten.co/development/model/performance/engine-builder-overview.md # Engine builder overview > Deploy optimized model inference servers in minutes If you have a foundation model like Llama 3 or a fine-tuned variant and want to create a low-latency, high-throughput model inference server, TensorRT-LLM via the Engine Builder is likely the tool for you. TensorRT-LLM is an open source performance optimization toolbox created by NVIDIA. It helps you build TensorRT engines for large language models like Llama and Mistral as well as certain other models like Whisper and large vision models. Baseten's TensorRT-LLM Engine Builder simplifies and automates the process of using TensorRT-LLM for development and production. All you need to do is write a few lines of configuration and an optimized model serving engine will be built automatically during the model deployment process. ## FAQs ### Where are the engines stored? The engines are stored in Baseten but owned by the user — we're working on a mechanism for downloading them. In the meantime, reach out if you need access to an engine that you created using the Engine Builder. ### Does the Engine Builder support quantization? Yes. The Engine Builder can perform post-training quantization during the building process. For supported options, see [quantization in the config reference](/development/model/performance/engine-builder-config#quantization-type). ### Can I customize the engine behavior? For further control over the TensorRT-LLM engine during inference, use the `model/model.py` file to access the engine object at runtime. See [controlling engines with Python](/development/model/performance/engine-builder-customization) for details. --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-async-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async environment > Use this endpoint to call the model associated with the specified environment asynchronously. ### Parameters The ID of the model you want to call. The name of the model's environment you want to call. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body There is a 256 KiB size limit to `/async_predict` request payloads. JSON-serializable model input. Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later. URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported. Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1). `priority` is between 0 and 2, inclusive. Maximum time a request will spend in the queue before expiring. `max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive. Exponential backoff parameters used to retry the model predict request. Number of predict request attempts. `max_attempts` must be between 1 and 10, inclusive. Minimum time between retries in milliseconds. `initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive. Maximum time between retries in milliseconds. `max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive. ### Response The ID of the async request. ```json 201 theme={"system"} { "request_id": "" } ``` ### Rate limits Two types of rate limits apply when making async requests: * Calls to the `/async_predict` endpoint are limited to **200 requests per second**. * Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments. If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code. To avoid hitting these rate limits, we advise: * Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors. * Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings. --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-async-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async chains environment > Use this endpoint to call the deployment associated with the specified environment asynchronously. ```sh theme={"system"} https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote" ``` ### Parameters The ID of the chain you want to call. The name of the chain's environment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" env_name = "staging" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require('node-fetch'); const resp = await fetch( 'https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote', { method: 'POST', headers: { Authorization: 'Api-Key YOUR_API_KEY' }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json 201 theme={"system"} { "request_id": "" } ``` --- # Source: https://docs.baseten.co/reference/inference-api/status-endpoints/environments-get-async-queue-status.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async environment > Use this endpoint to get the async queue status for a model associated with the specified environment. ### Parameters The ID of the model. The ID of the chain. The name of the environment. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Response The ID of the model. The ID of the deployment. The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model). The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model). ```json 200 theme={"system"} { "model_id": "", "deployment_id": "", "num_queued_requests": 12, "num_in_progress_requests": 3 } ``` ### Rate limits Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code. To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors. ```py Model theme={"system"} import requests import os model_id = "" env_name = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://model-{model_id}.api.baseten.co/environments/{env_name}/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` ```py Chain theme={"system"} import requests import os chain_id = "" env_name = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_queue_status", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-predict.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Environment Use this endpoint to call the deployment associated with the specified [environment](/deployment/environments). ```sh theme={"system"} https://model-{model_id}.api.baseten.co/environments/{env_name}/predict ``` ### Parameters The ID of the model you want to call. The name of the model's environment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable model input. ```python Python theme={"system"} import urllib3 import os model_id = "" env_name = "staging" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://model-{model_id}.api.baseten.co/environments/{env_name}/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable model input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://model-{model_id}.api.baseten.co/environments/{env_name}/predict \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable model input ``` ```javascript Node.js theme={"system"} const fetch = require("node-fetch"); const resp = await fetch( "https://model-{model_id}.api.baseten.co/environments/{env_name}/predict", { method: "POST", headers: { Authorization: "Api-Key YOUR_API_KEY" }, body: JSON.stringify({}), // JSON-serializable model input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response // JSON-serializable output varies by model theme={"system"} {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-run-remote.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Chains environment > Use this endpoint to call the deployment associated with the specified environment. ```sh theme={"system"} https://chain-{chain}.api.baseten.co/environments/{env_name}/run_remote" ``` ### Parameters The ID of the chain you want to call. The name of the chain's environment you want to call. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Body JSON-serializable chain input. The input schema corresponds to the signature of the entrypoint's `run_remote` method. I.e. The top-level keys are the argument names. The values are the corresponding JSON representation of the types. ```python Python theme={"system"} import urllib3 import os chain_id = "" env_name = "staging" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = urllib3.request( "POST", f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={}, # JSON-serializable chain input ) print(resp.json()) ``` ```sh cURL theme={"system"} curl -X POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{}' # JSON-serializable chain input ``` ```javascript Node.js theme={"system"} const fetch = require('node-fetch'); const resp = await fetch( 'https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote', { method: 'POST', headers: { Authorization: 'Api-Key YOUR_API_KEY' }, body: JSON.stringify({}), // JSON-serializable chain input } ); const data = await resp.json(); console.log(data); ``` ```json Example Response theme={"system"} // JSON-serializable output varies by chain {} ``` --- # Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-websocket.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Websocket environment Use this endpoint to connect via WebSockets to the deployment associated with the specified [environment](/deployment/environments). Note that `entity` here could be either `model` or `chain`, depending on whether you using Baseten models or Chains. ```sh theme={"system"} wss://{entity}-{entity_id}.api.baseten.co/environments/{env_name}/websocket" ``` See [WebSockets](/development/model/websockets) for more details. ### Parameters The type of entity you want to connect to. Either `model` or `chain`. The ID of the model or chain you want to connect to. The name of the environment you want to connect to. Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ```sh websocat theme={"system"} websocat -H 'Authorization: Api-Key YOUR_API_KEY' \ wss://{entity}-{model_id}.api.baseten.co/environments/{env_name}/websocket ``` --- # Source: https://docs.baseten.co/development/model/environments.md # Source: https://docs.baseten.co/deployment/environments.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Environments > Manage your model’s release cycles with environments. Environments provide structured management for deployments, ensuring controlled rollouts, stable endpoints, and autoscaling. They help teams stage, test, and release models without affecting production traffic. Deployments can be promoted to an environment (e.g., "staging") to validate outputs before moving to production, allowing for safer model iteration and evaluation. *** ## Using Environments to manage deployments Environments support **structured validation** before promoting a deployment, including: * **Automated tests and evaluations** * **Manual testing in pre-production** * **Gradual traffic shifts with canary deployments** * **Shadow serving for real-world analysis** Promoting a deployment ensures it inherits **environment-specific scaling and monitoring settings**, such as: * **Dedicated API endpoint** → [Predict API Reference](/reference/inference-api/overview#predict-endpoints) * **Autoscaling controls** → Scale behavior is managed per environment. * **Traffic ramp-up** → Enable [canary rollouts](/deployment/deployments#canary-deployments). * **Monitoring and metrics** → [Export environment metrics](/observability/export-metrics/overview). A **production environment** operates like any other environment but has restrictions: * **It cannot be deleted** unless the entire model is removed. * **You cannot create additional environments named "production."** *** ## Creating custom environments In addition to the standard **production** environment, you can create as many custom environments as needed. There are two ways to create a custom environment: 1. In the model management page on the Baseten dashboard. 2. Via the [create environment endpoint](/reference/management-api/environments/create-an-environment) in the model management API. *** ## Promoting deployments to environments When a deployment is promoted, Baseten follows a **three-step process**: 1. A **new deployment** is created with a unique deployment ID. 2. The deployment **initializes resources** and becomes active. 3. The new deployment **replaces the existing deployment** in that environment. * If there was **no previous deployment, default autoscaling settings** are applied. * If a **previous deployment existed**, the new one **inherits autoscaling settings**, and the old deployment is **demoted and scales to zero**. ### Promoting a Published Deployment If a **published deployment** (not a development deployment) is promoted: * Its **autoscaling settings are updated** to match the environment. * If **inactive**, it must be **activated** before promotion. Previous deployments are **demoted but remain in the system**, retaining their **deployment ID and scaling behavior**. *** ## Deploying directly to an environment You can **skip development stage** and deploy directly to an environment by specifying `--environment` in `truss push`: ```sh theme={"system"} cd my_model/ truss push --environment {environment_name} ``` Only one active promotion per environment is allowed at a time. *** ## Accessing environments in your code The **environment name** is available in `model.py` via the `environment` keyword argument: ```python theme={"system"} def __init__(self, **kwargs): self._environment = kwargs["environment"] ``` To ensure the **environment variable remains updated**, enable\*\* "Re-deploy when promoting" \*\*in the UI or via the [REST API](/reference/management-api/environments/update-an-environments-settings). This guarantees the environment is fully initialized after a promotion. *** ## Deleting environments Environments can be deleted, **except for production**. To remove a **production deployment**, first **promote another deployment to production** or delete the entire model. * **Deleted environments are removed from the overview** but remain in billing history. * **They do not consume resources** after deletion. * **API requests to a deleted environment return a 404 error.** Deletion is permanent - consider deactivation instead. --- # Source: https://docs.baseten.co/development/chain/errorhandling.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Error Handling > Understanding and handling Chains errors Error handling in Chains follows the principle that the root cause "bubbles up" until the entrypoint - which returns an error response. Similarly to how python stack traces contain all the layers from where an exception was raised up until the main function. Consider the case of a Chain where the entrypoint calls `run_remote` of a Chainlet named `TextToNum` and this in turn invokes `TextReplicator`. The respective `run_remote` methods might also use other helper functions that appear in the call stack. Below is an example stack trace that shows how the root cause (a `ValueError`) is propagated up to the entrypoint's `run_remote` method (this is what you would see as an error log): ``` Chainlet-Traceback (most recent call last): File "/packages/itest_chain.py", line 132, in run_remote value = self._accumulate_parts(text_parts.parts) File "/packages/itest_chain.py", line 144, in _accumulate_parts value += self._text_to_num.run_remote(part) ValueError: (showing chained remote errors, root error at the bottom) ├─ Error in dependency Chainlet `TextToNum`: │ Chainlet-Traceback (most recent call last): │ File "/packages/itest_chain.py", line 87, in run_remote │ generated_text = self._replicator.run_remote(data) │ ValueError: (showing chained remote errors, root error at the bottom) │ ├─ Error in dependency Chainlet `TextReplicator`: │ │ Chainlet-Traceback (most recent call last): │ │ File "/packages/itest_chain.py", line 52, in run_remote │ │ validate_data(data) │ │ File "/packages/itest_chain.py", line 36, in validate_data │ │ raise ValueError(f"This input is too long: {len(data)}.") ╰ ╰ ValueError: This input is too long: 100. ``` ## Exception handling and retries Above stack trace is what you see if you don't catch the exception. It is possible to add error handling around each remote Chainlet invocation. Chains tries to raise the same exception class on the *caller* Chainlet as was raised in the *dependency* Chainlet. * Builtin exceptions (e.g. `ValueError`) always work. * Custom or third-party exceptions (e.g. from `torch`) can be only raised in the caller if they are included in the dependencies of the caller as well. If the exception class cannot be resolved, a `GenericRemoteException` is raised instead. Note that the *message* of re-raised exceptions is the concatenation of the original message and the formatted stack trace of the dependency Chainlet. In some cases it might make sense to simply retry a remote invocation (e.g. if it failed due to some transient problems like networking or any "flaky" parts). `depends` can be configured with additional [options](/reference/sdk/chains#truss-chains-depends) for that. Below example shows how you can add automatic retries and error handling for the call to `TextReplicator` in `TextToNum`: ```python theme={"system"} import truss_chains as chains class TextToNum(chains.ChainletBase): def __init__( self, replicator: TextReplicator = chains.depends(TextReplicator, retries=3), ) -> None: self._replicator = replicator async def run_remote(self, data: ...): try: generated_text = await self._replicator.run_remote(data) except ValueError: ... # Handle error. ``` ## Stack filtering The stack trace is intended to show the user implemented code in `run_remote` (and user implemented helper functions). Under the hood, the calls from one Chainlet to another go through an HTTP connection, managed by the Chains framework. And each Chainlet itself is run as a FastAPI server with several layers of request handling code "above". In order to provide concise, readable stacks, all of this non-user code is filtered out. --- # Source: https://docs.baseten.co/inference/output-format/files.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Model I/O with files > Call models by passing a file or URL Baseten supports a wide variety of file-based I/O approaches. These examples show our recommendations for working with files during model inference, whether local or remote, public or private, in the Truss or in your invocation code. ## Files as input ### Example: Send a file with JSON-serializable content The Truss CLI has a `-f` flag to pass file input. If you're using the API endpoint via Python, get file contents with the standard `f.read()` function. ```sh Truss CLI theme={"system"} truss predict -f input.json ``` ```python Python script theme={"system"} import urllib3 import json model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] # Read input as JSON with open("input.json", "r") as f: data = json.loads(f.read()) resp = urllib3.request( "POST", # Endpoint for production deployment, see API reference for more f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data ) print(resp.json()) ``` ### Example: Send a file with non-serializable content The `-f` flag for `truss predict` only applies to JSON-serializable content. For other files, like the audio files required by [MusicGen Melody](https://www.baseten.co/library/musicgen-melody), the file content needs to be base64 encoded before it is sent. ```python theme={"system"} import urllib3 model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] # Open a local file with open("mymelody.wav", "rb") as f: # mono wav file, 48khz sample rate # Convert file contents into JSON-serializable format encoded_data = base64.b64encode(f.read()) encoded_str = encoded_data.decode("utf-8") # Define the data payload data = {"prompts": ["happy rock", "energetic EDM", "sad jazz"], "melody": encoded_str, "duration": 8} # Make the POST request response = requests.post(url, headers=headers, data=data) resp = urllib3.request( "POST", # Endpoint for production deployment, see API reference for more f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data ) data = resp.json()["data"] # Save output to files for idx, clip in enumerate(data): with open(f"clip_{idx}.wav", "wb") as f: f.write(base64.b64decode(clip)) ``` ### Example: Send a URL to a public file Rather than encoding and serializing a file to send in the HTTP request, you can instead write a Truss that takes a URL as input and loads the content in the `preprocess()` function. Here's an example from [Whisper in the model library](https://www.baseten.co/library/whisper-v3). ```python theme={"system"} from tempfile import NamedTemporaryFile import requests # Get file content without blocking GPU def preprocess(self, request): resp = requests.get(request["url"]) return {"content": resp.content} # Use file content in model inference def predict(self, model_input): with NamedTemporaryFile() as fp: fp.write(model_input["content"]) result = whisper.transcribe( self._model, fp.name, temperature=0, best_of=5, beam_size=5, ) segments = [ {"start": r["start"], "end": r["end"], "text": r["text"]} for r in result["segments"] ] return { "language": whisper.tokenizer.LANGUAGES[result["language"]], "segments": segments, "text": result["text"], } ``` ## Files as output ### Example: Save model output to local file When saving model output to a local file, there's nothing Baseten-specific about the code. Just use the standard `>` operator in bash or `file.write()` function in Python to save the model output. ```sh Truss CLI theme={"system"} truss predict -d '"Model input!"' > output.json ``` ```python Python script theme={"system"} import urllib3 import json model_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] # Call model resp = urllib3.request( "POST", # Endpoint for production deployment, see API reference for more f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=json.dumps("Model input!") ) # Write results to file with open("output.json", "w") as f: f.write(resp.json()) ``` Output for some models, like image and audio generation models, may need to be decoded before you save it. See our [image generation example](/examples/image-generation) for how to parse base64 output. --- # Source: https://docs.baseten.co/examples/models/flux/flux-schnell.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Flux-Schnell > Flux-Schnell is a state-of-the-art image generation model export const BFLIconCard = ({title, href}) => } horizontal />; ## Example usage The model accepts a `prompt` which is some text describing the image you want to generate. The output images tend to get better as you add more descriptive words to the prompt. The output JSON object contains a key called `data` which represents the generated image as a base64 string. ### Input ```python theme={"system"} import httpx import os import base64 from PIL import Image from io import BytesIO # Replace the empty string with your model id below model_id = "" baseten_api_key = os.environ["BASETEN_API_KEY"] # Function used to convert a base64 string to a PIL image def b64_to_pil(b64_str): return Image.open(BytesIO(base64.b64decode(b64_str))) data = { "prompt": 'red velvet cake spelling out the words "FLUX SCHNELL", tasty, food photography, dynamic shot' } # Call model endpoint res = httpx.post( f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data ) # Get output image res = res.json() output = res.get("data") # Convert the base64 model output to an image img = b64_to_pil(output) img.save("output_image.jpg") ``` ### JSON output ```json theme={"system"} { "output": "iVBORw0KGgoAAAANSUhEUgAABAAAAAQACAIAAA..." } ``` --- # Source: https://docs.baseten.co/engines/performance-concepts/function-calling.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Function calling > Tool selection and structured function calls with LLMs Function calling is supported by Baseten engines including [BIS-LLM](/engines/bis-llm/overview) and [Engine-Builder-LLM](/engines/engine-builder-llm/overview), as well as [Model APIs](/development/model-apis/overview) for instant access. It's also compatible with other inference frameworks like [vLLM](/examples/vllm) and [SGLang](/examples/sglang). ## Overview *Function calling* (also known as *tool calling*) lets a model **choose a tool and produce arguments** based on a user request. **Important:** the model **does not execute** your Python function. Your application must: 1. run the tool, and 2. optionally send the tool’s output back to the model to produce a final, user-facing response. This is a great fit for [chains](/development/chain/overview) and other orchestrators. *** ## How tool calling works A typical tool-calling loop looks like: 1. **Send** the user message and a list of tools. 2. The model returns either normal text or one or more **tool calls** (name and JSON arguments). 3. **Execute** the tool calls in your application. 4. **Send tool output** back to the model. 5. Receive a **final response** or additional tool calls. *** ## 1. Define tools Tools can be anything: API calls, database queries, internal scripts, etc. Docstrings matter. Models use them to decide which tool to call and how to fill parameters: ```python theme={"system"} def multiply(a: float, b: float): """Multiply two numbers. Args: a: The first number. b: The second number. """ return a * b def divide(a: float, b: float): """Divide two numbers. Args: a: The dividend. b: The divisor (must be non-zero). """ return a / b def add(a: float, b: float): """Add two numbers. Args: a: The first number. b: The second number. """ return a + b def subtract(a: float, b: float): """Subtract two numbers. Args: a: The number to subtract from. b: The number to subtract. """ return a - b ``` ### Tool-writing tips Design small, single-purpose tools and document constraints in docstrings (units, allowed values, required fields). Treat model-provided arguments as untrusted input and validate before execution. *** ## 2. Serialize functions Convert functions into JSON-schema tool definitions (OpenAI-compatible format): ```python theme={"system"} from transformers.utils import get_json_schema calculator_functions = { "multiply": multiply, "divide": divide, "add": add, "subtract": subtract, } tools = [get_json_schema(f) for f in calculator_functions.values()] ``` *** ## 3. Call the model Include the `tools` array in your request: ```python theme={"system"} import requests payload = { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 3.14 + 3.14?"}, ], "tools": tools, "tool_choice": "auto", # default } MODEL_ID = "" BASETEN_API_KEY = "" resp = requests.post( f"https://model-{MODEL_ID}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"}, json=payload, ) ``` *** ## 4. Control tool selection Set `tool_choice` to control how the model uses tools. With `auto` (default), the model may respond with text or tool calls. With `required`, the model must return at least one tool call. With `none`, the model returns plain text only. To force a specific tool: ```python theme={"system"} "tool_choice": {"type": "function", "function": {"name": "subtract"}} ``` *** ## 5. Parse and execute tool calls Depending on the engine and model, tool calls are typically returned in an assistant message under `tool_calls`: ```python theme={"system"} import json data = resp.json() message = data["choices"][0]["message"] tool_calls = message.get("tool_calls") or [] for tool_call in tool_calls: name = tool_call["function"]["name"] args = json.loads(tool_call["function"]["arguments"]) # Validate args in production. result = calculator_functions[name](**args) print(result) ``` ### Full loop: send tool output back for a final answer If you want the model to turn raw tool output into a user-facing response, append the assistant message and a tool response with the matching `tool_call_id`: ```python theme={"system"} # Continue the conversation messages = payload["messages"] messages.append(message) # assistant tool call message # Example: respond to the first tool call tool_call = tool_calls[0] name = tool_call["function"]["name"] args = json.loads(tool_call["function"]["arguments"]) result = calculator_functions[name](**args) messages.append({ "role": "tool", "tool_call_id": tool_call["id"], "content": json.dumps({"result": result}), }) final_payload = { **payload, "messages": messages, } final_resp = requests.post( f"https://model-{MODEL_ID}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"}, json=final_payload, ) print(final_resp.json()["choices"][0]["message"].get("content")) ``` *** ## Practical tips Use low temperature (0.0–0.3) for reliable tool selection and argument values. Add `enum` and `required` constraints in your JSON schema to guide model outputs. Consider parallel tool calls only if your model supports them. Always validate and sanitize inputs before calling real systems. *** ## Further reading * [Chains](/development/chain/overview): Orchestrate multi-step workflows. * [Custom engine builder](/engines/engine-builder-llm/custom-engine-builder): Advanced configuration options. --- # Source: https://docs.baseten.co/examples/models/gemma/gemma-3-27b-it.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Gemma 3 27B IT > Instruct-tuned open model by Google with excellent ELO/size tradeoff and vision capabilities export const GoogleIconCard = ({title, href}) => } horizontal />; # Example usage Gemma 3 is an OpenAI-compatible model and can be called using the OpenAI SDK in any language. ```python theme={"system"} from openai import OpenAI import os model_url = "" # Copy in from API pane in Baseten model dashboard client = OpenAI( api_key=os.environ['BASETEN_API_KEY'], base_url=model_url ) # Chat completion response_chat = client.chat.completions.create( model="", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://picsum.photos/id/237/200/300", }, }, ], }], temperature=0.3, max_tokens=512, ) print(response_chat) ``` **JSON Output** ```json theme={"system"} { "id": "143", "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "[Model output here]", "role": "assistant", "audio": null, "function_call": null, "tool_calls": null } } ], "created": 1741224586, "model": "", "object": "chat.completion", "service_tier": null, "system_fingerprint": null, "usage": { "completion_tokens": 145, "prompt_tokens": 38, "total_tokens": 183, "completion_tokens_details": null, "prompt_tokens_details": null } } ``` --- # Source: https://docs.baseten.co/reference/management-api/environments/get-a-chain-environments-details.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get Chain environment > Gets a chain environment's details and returns the chain environment. ## OpenAPI ````yaml get /v1/chains/{chain_id}/environments/{env_name} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/environments/{env_name}: parameters: - $ref: '#/components/parameters/chain_id' - $ref: '#/components/parameters/env_name' get: summary: Get a chain environment's details description: Gets a chain environment's details and returns the chain environment. responses: '200': description: Environment for oracles. content: application/json: schema: $ref: '#/components/schemas/ChainEnvironmentV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true env_name: schema: type: string name: env_name in: path required: true schemas: ChainEnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment chainlet_settings: description: Environment settings for the chainlets items: $ref: '#/components/schemas/ChainletEnvironmentSettingsV1' title: Chainlet Settings type: array current_deployment: anyOf: - $ref: '#/components/schemas/ChainDeploymentV1' - type: 'null' description: Current chain deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/ChainDeploymentV1' - type: 'null' default: null description: >- Candidate chain deployment being promoted to the environment, if a promotion is in progress required: - name - created_at - chain_id - promotion_settings - chainlet_settings - current_deployment title: ChainEnvironmentV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object ChainletEnvironmentSettingsV1: description: Environment settings for a chainlet. properties: chainlet_name: description: Name of the chainlet title: Chainlet Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the chainlet required: - chainlet_name - autoscaling_settings - instance_type title: ChainletEnvironmentSettingsV1 type: object ChainDeploymentV1: description: A deployment of a chain. properties: id: description: Unique identifier of the chain deployment title: Id type: string created_at: description: Time the chain deployment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string environment: anyOf: - type: string - type: 'null' description: Environment the chain deployment is deployed in title: Environment chainlets: description: Chainlets in the chain deployment items: $ref: '#/components/schemas/ChainletV1' title: Chainlets type: array status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chain deployment required: - id - created_at - chain_id - environment - chainlets - status title: ChainDeploymentV1 type: object RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object ChainletV1: description: A chainlet in a chain deployment. properties: id: description: Unique identifier of the chainlet title: Id type: string name: description: Name of the chainlet title: Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type_name: description: Name of the instance type the chainlet is deployed on title: Instance Type Name type: string active_replica_count: description: Number of active replicas title: Active Replica Count type: integer status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chainlet required: - id - name - autoscaling_settings - instance_type_name - active_replica_count - status title: ChainletV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/environments/get-all-chain-environments.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all Chain environments > Gets all chain environments for a given chain ## OpenAPI ````yaml get /v1/chains/{chain_id}/environments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/environments: parameters: - $ref: '#/components/parameters/chain_id' get: summary: Get all chain environments description: Gets all chain environments for a given chain responses: '200': description: list of environments content: application/json: schema: $ref: '#/components/schemas/EnvironmentsV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true schemas: EnvironmentsV1: description: list of environments properties: environments: items: $ref: '#/components/schemas/EnvironmentV1' title: Environments type: array required: - environments title: EnvironmentsV1 type: object EnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string model_id: description: Unique identifier of the model title: Model Id type: string current_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' description: Current deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' default: null description: >- Candidate deployment being promoted to the environment, if a promotion is in progress autoscaling_settings: $ref: '#/components/schemas/AutoscalingSettingsV1' description: Autoscaling settings for the environment promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the environment required: - name - created_at - model_id - current_deployment - autoscaling_settings - promotion_settings - instance_type title: EnvironmentV1 type: object DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/environments/get-all-environments.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all environments > Gets all environments for a given model ## OpenAPI ````yaml get /v1/models/{model_id}/environments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments: parameters: - $ref: '#/components/parameters/model_id' get: summary: Get all environments description: Gets all environments for a given model responses: '200': description: list of environments content: application/json: schema: $ref: '#/components/schemas/EnvironmentsV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: EnvironmentsV1: description: list of environments properties: environments: items: $ref: '#/components/schemas/EnvironmentV1' title: Environments type: array required: - environments title: EnvironmentsV1 type: object EnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string model_id: description: Unique identifier of the model title: Model Id type: string current_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' description: Current deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' default: null description: >- Candidate deployment being promoted to the environment, if a promotion is in progress autoscaling_settings: $ref: '#/components/schemas/AutoscalingSettingsV1' description: Autoscaling settings for the environment promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the environment required: - name - created_at - model_id - current_deployment - autoscaling_settings - promotion_settings - instance_type title: EnvironmentV1 type: object DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/environments/get-an-environments-details.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get environment > Gets an environment's details and returns the environment. ## OpenAPI ````yaml get /v1/models/{model_id}/environments/{env_name} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/environments/{env_name}: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/env_name' get: summary: Get an environment's details description: Gets an environment's details and returns the environment. responses: '200': description: Environment for oracles. content: application/json: schema: $ref: '#/components/schemas/EnvironmentV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true env_name: schema: type: string name: env_name in: path required: true schemas: EnvironmentV1: description: Environment for oracles. properties: name: description: Name of the environment title: Name type: string created_at: description: Time the environment was created in ISO 8601 format format: date-time title: Created At type: string model_id: description: Unique identifier of the model title: Model Id type: string current_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' description: Current deployment of the environment candidate_deployment: anyOf: - $ref: '#/components/schemas/DeploymentV1' - type: 'null' default: null description: >- Candidate deployment being promoted to the environment, if a promotion is in progress autoscaling_settings: $ref: '#/components/schemas/AutoscalingSettingsV1' description: Autoscaling settings for the environment promotion_settings: $ref: '#/components/schemas/PromotionSettingsV1' description: Promotion settings for the environment instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type for the environment required: - name - created_at - model_id - current_deployment - autoscaling_settings - promotion_settings - instance_type title: EnvironmentV1 type: object DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object PromotionSettingsV1: description: Promotion settings for promoting chains and oracles properties: redeploy_on_promotion: anyOf: - type: boolean - type: 'null' default: false description: >- Whether to deploy on all promotions. Enabling this flag allows model code to safely handle environment-specific logic. When a deployment is promoted, a new deployment will be created with a copy of the image. examples: - true title: Redeploy On Promotion rolling_deploy: anyOf: - type: boolean - type: 'null' default: false description: Whether the environment should rely on rolling deploy orchestration. examples: - true title: Rolling Deploy rolling_deploy_config: anyOf: - $ref: '#/components/schemas/RollingDeployConfigV1' - type: 'null' default: null description: Rolling deploy configuration for promotions ramp_up_while_promoting: anyOf: - type: boolean - type: 'null' default: false description: Whether to ramp up traffic while promoting examples: - true title: Ramp Up While Promoting ramp_up_duration_seconds: anyOf: - type: integer - type: 'null' default: 600 description: Duration of the ramp up in seconds examples: - 600 title: Ramp Up Duration Seconds title: PromotionSettingsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string RollingDeployConfigV1: description: Rolling deploy config for promoting chains and oracles properties: rolling_deploy_strategy: $ref: '#/components/schemas/RollingDeployStrategyV1' default: REPLICA description: The rolling deploy strategy to use for promotions. examples: - REPLICA max_surge_percent: default: 20 description: The maximum surge percentage for rolling deploys. examples: - 20 title: Max Surge Percent type: integer max_unavailable_percent: default: 0 description: The maximum unavailable percentage for rolling deploys. examples: - 20 title: Max Unavailable Percent type: integer stabilization_time_seconds: default: 0 description: The stabilization time in seconds for rolling deploys. examples: - 300 title: Stabilization Time Seconds type: integer promotion_cleanup_strategy: $ref: '#/components/schemas/PromotionCleanupStrategyV1' default: SCALE_TO_ZERO description: The promotion cleanup strategy to use for rolling deploys. examples: - SCALE_TO_ZERO title: RollingDeployConfigV1 type: object RollingDeployStrategyV1: description: The rolling deploy strategy. enum: - REPLICA title: RollingDeployStrategyV1 type: string PromotionCleanupStrategyV1: description: The promotion cleanup strategy. enum: - KEEP - SCALE_TO_ZERO title: PromotionCleanupStrategyV1 type: string securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/inference-api/status-endpoints/get-async-request-status.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Async request > Use this endpoint to get the status of an async request. ### Parameters The ID of the model. The ID of the chain. The ID of the async request. ### Headers Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`). ### Response The ID of the async request. The ID of the model that executed the request. The ID of the deployment that executed the request. An enum representing the status of the request. Available options: `QUEUED`, `IN_PROGRESS`, `SUCCEEDED`, `FAILED`, `EXPIRED`, `CANCELED`, `WEBHOOK_FAILED` An enum representing the status of sending the predict result to the provided webhook. Available options: `PENDING`, `SUCCEEDED`, `FAILED`, `CANCELED`, `NO_WEBHOOK_PROVIDED` The time in UTC at which the async request was created. The time in UTC at which the async request's status was updated. Any errors that occurred in processing the async request. Empty if no errors occurred. An enum representing the type of error that occurred. Available options: `MODEL_PREDICT_ERROR`, `MODEL_PREDICT_TIMEOUT`, `MODEL_NOT_READY`, `MODEL_DOES_NOT_EXIST`, `MODEL_UNAVAILABLE`, `MODEL_INVALID_INPUT`, `ASYNC_REQUEST_NOT_SUPPORTED`, `INTERNAL_SERVER_ERROR` A message containing details of the error that occurred. The ID of the async request. The ID of the chain that executed the request. The ID of the deployment that executed the request. An enum representing the status of the request. Available options: `QUEUED`, `IN_PROGRESS`, `SUCCEEDED`, `FAILED`, `EXPIRED`, `CANCELED`, `WEBHOOK_FAILED` An enum representing the status of sending the predict result to the provided webhook. Available options: `PENDING`, `SUCCEEDED`, `FAILED`, `CANCELED`, `NO_WEBHOOK_PROVIDED` The time in UTC at which the async request was created. The time in UTC at which the async request's status was updated. Any errors that occurred in processing the async request. Empty if no errors occurred. An enum representing the type of error that occurred. Available options: `MODEL_PREDICT_ERROR`, `MODEL_PREDICT_TIMEOUT`, `MODEL_NOT_READY`, `MODEL_DOES_NOT_EXIST`, `MODEL_UNAVAILABLE`, `MODEL_INVALID_INPUT`, `ASYNC_REQUEST_NOT_SUPPORTED`, `INTERNAL_SERVER_ERROR` A message containing details of the error that occurred. ```json 200 (Model) theme={"system"} { "request_id": "", "model_id": "", "deployment_id": "", "status": "", "webhook_status": "", "created_at": "", "status_at": "", "errors": [ { "code": "", "message": "" } ] } ``` ```json 200 (Chain) theme={"system"} { "request_id": "", "chain_id": "", "deployment_id": "", "status": "", "webhook_status": "", "created_at": "", "status_at": "", "errors": [ { "code": "", "message": "" } ] } ``` ### Rate limits Calls to the get async request status endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code. To avoid hitting this rate limit, we recommend [configuring a webhook endpoint](/inference/async#configuring-the-webhook-endpoint) to receive async predict results instead of frequently polling this endpoint for async request statuses. ```python Python (Model) theme={"system"} import requests import os model_id = "" request_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://model-{model_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` ```python Python (Chain) theme={"system"} import requests import os chain_id = "" request_id = "" # Read secrets from environment variables baseten_api_key = os.environ["BASETEN_API_KEY"] resp = requests.get( f"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}", headers={"Authorization": f"Api-Key {baseten_api_key}"} ) print(resp.json()) ``` --- # Source: https://docs.baseten.co/reference/training-api/get-training-job-checkpoint-files.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get training job checkpoint files > Get presigned URLs for all checkpoint files for a training job. ## OpenAPI ````yaml get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoint_files openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoint_files: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' get: summary: Get training job checkpoint files. description: Get presigned URLs for all checkpoint files for a training job. responses: '200': description: >- A response to fetch presigned URLs for checkpoint files of a training job. content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobCheckpointFilesResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: GetTrainingJobCheckpointFilesResponseV1: description: >- A response to fetch presigned URLs for checkpoint files of a training job. properties: presigned_urls: description: List of presigned URLs for checkpoint files. items: $ref: '#/components/schemas/CheckpointFile' title: Presigned Urls type: array next_page_token: anyOf: - type: integer - type: 'null' default: null description: >- Token to use for fetching the next page of results. None when there are no more results. title: Next Page Token total_count: description: Total number of checkpoint files available. title: Total Count type: integer required: - presigned_urls - total_count title: GetTrainingJobCheckpointFilesResponseV1 type: object CheckpointFile: properties: url: title: Url type: string relative_file_name: title: Relative File Name type: string node_rank: title: Node Rank type: integer size_bytes: title: Size Bytes type: integer last_modified: title: Last Modified type: string required: - url - relative_file_name - node_rank - size_bytes - last_modified title: CheckpointFile type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/get-training-job-checkpoints.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # List training job checkpoints > Get the checkpoints for a training job. ## OpenAPI ````yaml get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoints openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoints: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' get: summary: Get training job checkpoints. description: Get the checkpoints for a training job. responses: '200': description: A response to fetch checkpoints for a training job. content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobCheckpointsResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: GetTrainingJobCheckpointsResponseV1: description: A response to fetch checkpoints for a training job. properties: training_job: $ref: '#/components/schemas/TrainingJobV1' description: The training job. checkpoints: description: The checkpoints for the training job. items: $ref: '#/components/schemas/TrainingJobCheckpointV1' title: Checkpoints type: array required: - training_job - checkpoints title: GetTrainingJobCheckpointsResponseV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object TrainingJobCheckpointV1: description: A checkpoint for a training job. properties: training_job_id: description: The ID of the training job. title: Training Job Id type: string checkpoint_id: description: The ID of the checkpoint. title: Checkpoint Id type: string created_at: description: The timestamp of the checkpoint in ISO 8601 format. format: date-time title: Created At type: string checkpoint_type: description: The type of checkpoint. title: Checkpoint Type type: string base_model: anyOf: - type: string - type: 'null' description: The base model of the checkpoint. title: Base Model lora_adapter_config: anyOf: - additionalProperties: true type: object - type: 'null' description: The adapter config of the checkpoint. title: Lora Adapter Config size_bytes: description: The size of the checkpoint in bytes. title: Size Bytes type: integer required: - training_job_id - checkpoint_id - created_at - checkpoint_type - base_model - lora_adapter_config - size_bytes title: TrainingJobCheckpointV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/get-training-job-logs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get training job logs > Get the logs for a training job with the provided filters. ## OpenAPI ````yaml post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/logs openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}/logs: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' post: summary: Get the logs for a training job. description: Get the logs for a training job with the provided filters. requestBody: content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobLogsRequestV1' required: true responses: '200': description: A response to querying logs. content: application/json: schema: $ref: '#/components/schemas/GetLogsResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: GetTrainingJobLogsRequestV1: description: A request to fetch training logs. properties: start_epoch_millis: anyOf: - type: integer - type: 'null' default: null description: Epoch millis timestamp to start fetching logs title: Start Epoch Millis end_epoch_millis: anyOf: - type: integer - type: 'null' default: null description: Epoch millis timestamp to end fetching logs title: End Epoch Millis direction: anyOf: - $ref: '#/components/schemas/SortOrderV1' - type: 'null' default: null description: Sort order for logs limit: anyOf: - maximum: 1000 minimum: 1 type: integer - type: 'null' default: 500 description: Limit of logs to fetch in a single request title: Limit title: GetTrainingJobLogsRequestV1 type: object GetLogsResponseV1: description: A response to querying logs. properties: logs: description: Logs for a specific entity. items: $ref: '#/components/schemas/LogV1' title: Logs type: array required: - logs title: GetLogsResponseV1 type: object SortOrderV1: enum: - asc - desc title: SortOrderV1 type: string LogV1: properties: timestamp: description: Epoch nanosecond timestamp of the log message. title: Timestamp type: string message: description: The contents of the log message. title: Message type: string replica: anyOf: - type: string - type: 'null' description: The replica the log line was emitted from. title: Replica required: - timestamp - message - replica title: LogV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/get-training-job-metrics.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get training job metrics > Get the metrics for a training job. ## OpenAPI ````yaml post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/metrics openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}/metrics: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' post: summary: Get the metrics for a training job. description: Get the metrics for a training job. requestBody: content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobMetricsRequestV1' required: true responses: '200': description: >- A response to fetch training job metrics. The outer list for each metric represents that metric across time. content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobMetricsResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: GetTrainingJobMetricsRequestV1: description: >- A request to fetch metrics. Allows the user to request metrics over a period of time. properties: end_epoch_millis: anyOf: - type: integer - type: 'null' default: null description: Epoch millis timestamp to end fetching metrics title: End Epoch Millis start_epoch_millis: anyOf: - type: integer - type: 'null' default: null description: Epoch millis timestamp to start fetching metrics. title: Start Epoch Millis title: GetTrainingJobMetricsRequestV1 type: object GetTrainingJobMetricsResponseV1: description: >- A response to fetch training job metrics. The outer list for each metric represents that metric across time. properties: gpu_memory_usage_bytes: additionalProperties: items: $ref: '#/components/schemas/TrainingJobMetricV1' type: array description: >- A map of GPU rank to memory usage for the training job. For multinode jobs, this is the memory usage of the leader unless specified otherwise. title: Gpu Memory Usage Bytes type: object gpu_utilization: additionalProperties: items: $ref: '#/components/schemas/TrainingJobMetricV1' type: array description: >- A map of GPU rank to fractional GPU utilization. For multinode jobs, this is the GPU utilization of the leader unless specified otherwise. title: Gpu Utilization type: object cpu_usage: description: >- The CPU usage measured in cores. For multinode jobs, this is the CPU usage of the leader unless specified otherwise. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Cpu Usage type: array cpu_memory_usage_bytes: description: >- The CPU memory usage for the training job. For multinode jobs, this is the CPU memory usage of the leader unless specified otherwise. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Cpu Memory Usage Bytes type: array ephemeral_storage: $ref: '#/components/schemas/StorageMetricsV1' description: >- The storage usage for the ephemeral storage. For multinode jobs, this is the ephemeral storage usage of the leader unless specified otherwise. training_job: $ref: '#/components/schemas/TrainingJobV1' description: The training job. cache: anyOf: - $ref: '#/components/schemas/StorageMetricsV1' - type: 'null' description: The storage usage for the read-write cache. per_node_metrics: description: The metrics for each node in the training job. items: $ref: '#/components/schemas/TrainingJobNodeMetricsV1' title: Per Node Metrics type: array required: - gpu_memory_usage_bytes - gpu_utilization - cpu_usage - cpu_memory_usage_bytes - ephemeral_storage - training_job - cache - per_node_metrics title: GetTrainingJobMetricsResponseV1 type: object TrainingJobMetricV1: description: A metric for a training job. properties: value: description: The value of the metric. title: Value type: number timestamp: description: The timestamp of the metric in ISO 8601 format. format: date-time title: Timestamp type: string required: - value - timestamp title: TrainingJobMetricV1 type: object StorageMetricsV1: description: A metric for a training job. properties: usage_bytes: description: The number of bytes used on the storage entity. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Usage Bytes type: array utilization: description: The utilization of the storage entity as a decimal percentage. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Utilization type: array required: - usage_bytes - utilization title: StorageMetricsV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object TrainingJobNodeMetricsV1: description: A set of metrics for a training job node. properties: node_id: description: The name of the node. title: Node Id type: string metrics: $ref: '#/components/schemas/TrainingJobMetricsV1' description: The metrics for the node. required: - node_id - metrics title: TrainingJobNodeMetricsV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object TrainingJobMetricsV1: properties: gpu_memory_usage_bytes: additionalProperties: items: $ref: '#/components/schemas/TrainingJobMetricV1' type: array description: >- A map of GPU rank to memory usage for the training job. For multinode jobs, this is the memory usage of the leader unless specified otherwise. title: Gpu Memory Usage Bytes type: object gpu_utilization: additionalProperties: items: $ref: '#/components/schemas/TrainingJobMetricV1' type: array description: >- A map of GPU rank to fractional GPU utilization. For multinode jobs, this is the GPU utilization of the leader unless specified otherwise. title: Gpu Utilization type: object cpu_usage: description: >- The CPU usage measured in cores. For multinode jobs, this is the CPU usage of the leader unless specified otherwise. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Cpu Usage type: array cpu_memory_usage_bytes: description: >- The CPU memory usage for the training job. For multinode jobs, this is the CPU memory usage of the leader unless specified otherwise. items: $ref: '#/components/schemas/TrainingJobMetricV1' title: Cpu Memory Usage Bytes type: array ephemeral_storage: $ref: '#/components/schemas/StorageMetricsV1' description: >- The storage usage for the ephemeral storage. For multinode jobs, this is the ephemeral storage usage of the leader unless specified otherwise. required: - gpu_memory_usage_bytes - gpu_utilization - cpu_usage - cpu_memory_usage_bytes - ephemeral_storage title: TrainingJobMetricsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/get-training-job.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get training job > Get the details of an existing training job. ## OpenAPI ````yaml get /v1/training_projects/{training_project_id}/jobs/{training_job_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects/{training_project_id}/jobs/{training_job_id}: parameters: - $ref: '#/components/parameters/training_project_id' - $ref: '#/components/parameters/training_job_id' get: summary: Get a training job. description: Get the details of an existing training job. responses: '200': description: A response to fetch a training job. content: application/json: schema: $ref: '#/components/schemas/GetTrainingJobResponseV1' components: parameters: training_project_id: schema: type: string name: training_project_id in: path required: true training_job_id: schema: type: string name: training_job_id in: path required: true schemas: GetTrainingJobResponseV1: description: A response to fetch a training job. properties: training_project: $ref: '#/components/schemas/TrainingProjectV1' description: The training project. training_job: $ref: '#/components/schemas/TrainingJobV1' description: The fetched training job. required: - training_project - training_job title: GetTrainingJobResponseV1 type: object TrainingProjectV1: properties: id: description: Unique identifier of the training project title: Id type: string name: description: Name of the training project. title: Name type: string created_at: description: Time the training project was created in ISO 8601 format. format: date-time title: Created At type: string updated_at: description: Time the training project was updated in ISO 8601 format. format: date-time title: Updated At type: string team_name: anyOf: - type: string - type: 'null' default: null description: Name of the team associated with the training project. title: Team Name latest_job: anyOf: - $ref: '#/components/schemas/TrainingJobV1' - type: 'null' description: Most recently created training job for the training project. required: - id - name - created_at - updated_at - latest_job title: TrainingProjectV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/training-api/get-training-projects.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # List training projects > List all training projects for the organization. ## OpenAPI ````yaml get /v1/training_projects openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/training_projects: get: summary: List training projects. description: List all training projects for the organization. responses: '200': description: A response to list training projects. content: application/json: schema: $ref: '#/components/schemas/ListTrainingProjectsResponseV1' components: schemas: ListTrainingProjectsResponseV1: description: A response to list training projects. properties: training_projects: description: List of training projects. items: $ref: '#/components/schemas/TrainingProjectV1' title: Training Projects type: array required: - training_projects title: ListTrainingProjectsResponseV1 type: object TrainingProjectV1: properties: id: description: Unique identifier of the training project title: Id type: string name: description: Name of the training project. title: Name type: string created_at: description: Time the training project was created in ISO 8601 format. format: date-time title: Created At type: string updated_at: description: Time the training project was updated in ISO 8601 format. format: date-time title: Updated At type: string team_name: anyOf: - type: string - type: 'null' default: null description: Name of the team associated with the training project. title: Team Name latest_job: anyOf: - $ref: '#/components/schemas/TrainingJobV1' - type: 'null' description: Most recently created training job for the training project. required: - id - name - created_at - updated_at - latest_job title: TrainingProjectV1 type: object TrainingJobV1: properties: id: description: Unique identifier of the training job. title: Id type: string created_at: description: Time the job was created in ISO 8601 format. format: date-time title: Created At type: string current_status: description: Current status of the training job. title: Current Status type: string error_message: anyOf: - type: string - type: 'null' default: null description: Error message if the training job failed. title: Error Message instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type of the training job. updated_at: description: Time the job was updated in ISO 8601 format. format: date-time title: Updated At type: string training_project_id: description: ID of the training project. title: Training Project Id type: string training_project: $ref: '#/components/schemas/TrainingProjectSummaryV1' description: Summary of the training project. name: anyOf: - type: string - type: 'null' default: null description: Name of the training job. examples: - gpt-oss-job title: Name required: - id - created_at - current_status - instance_type - updated_at - training_project_id - training_project title: TrainingJobV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object TrainingProjectSummaryV1: description: A summary of a training project. properties: id: description: Unique identifier of the training project. title: Id type: string name: description: Name of the training project. title: Name type: string required: - id - name title: TrainingProjectSummaryV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/chains/gets-a-chain-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # By ID ## OpenAPI ````yaml get /v1/chains/{chain_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}: parameters: - $ref: '#/components/parameters/chain_id' get: summary: Gets a chain by ID responses: '200': description: A chain. content: application/json: schema: $ref: '#/components/schemas/ChainV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true schemas: ChainV1: description: A chain. properties: id: description: Unique identifier of the chain title: Id type: string created_at: description: Time the chain was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the chain title: Name type: string deployments_count: description: Number of deployments of the chain title: Deployments Count type: integer team_name: description: Name of the team associated with the chain title: Team Name type: string required: - id - created_at - name - deployments_count - team_name title: ChainV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-chain-deployment-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Any chain deployment by ID ## OpenAPI ````yaml get /v1/chains/{chain_id}/deployments/{chain_deployment_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/deployments/{chain_deployment_id}: parameters: - $ref: '#/components/parameters/chain_id' - $ref: '#/components/parameters/chain_deployment_id' get: summary: Gets a chain deployment by ID responses: '200': description: A deployment of a chain. content: application/json: schema: $ref: '#/components/schemas/ChainDeploymentV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true chain_deployment_id: schema: type: string name: chain_deployment_id in: path required: true schemas: ChainDeploymentV1: description: A deployment of a chain. properties: id: description: Unique identifier of the chain deployment title: Id type: string created_at: description: Time the chain deployment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string environment: anyOf: - type: string - type: 'null' description: Environment the chain deployment is deployed in title: Environment chainlets: description: Chainlets in the chain deployment items: $ref: '#/components/schemas/ChainletV1' title: Chainlets type: array status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chain deployment required: - id - created_at - chain_id - environment - chainlets - status title: ChainDeploymentV1 type: object ChainletV1: description: A chainlet in a chain deployment. properties: id: description: Unique identifier of the chainlet title: Id type: string name: description: Name of the chainlet title: Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type_name: description: Name of the instance type the chainlet is deployed on title: Instance Type Name type: string active_replica_count: description: Number of active replicas title: Active Replica Count type: integer status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chainlet required: - id - name - autoscaling_settings - instance_type_name - active_replica_count - status title: ChainletV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/models/gets-a-model-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # By ID ## OpenAPI ````yaml get /v1/models/{model_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}: parameters: - $ref: '#/components/parameters/model_id' get: summary: Gets a model by ID responses: '200': description: A model. content: application/json: schema: $ref: '#/components/schemas/ModelV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: ModelV1: description: A model. properties: id: description: Unique identifier of the model title: Id type: string created_at: description: Time the model was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the model title: Name type: string deployments_count: description: Number of deployments of the model title: Deployments Count type: integer production_deployment_id: anyOf: - type: string - type: 'null' description: Unique identifier of the production deployment of the model title: Production Deployment Id development_deployment_id: anyOf: - type: string - type: 'null' description: Unique identifier of the development deployment of the model title: Development Deployment Id instance_type_name: description: Name of the instance type for the production deployment of the model title: Instance Type Name type: string team_name: description: Name of the team associated with the model. title: Team Name type: string required: - id - created_at - name - deployments_count - production_deployment_id - development_deployment_id - instance_type_name - team_name title: ModelV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-deployment-by-id.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Any model deployment by ID > Gets a model's deployment by ID and returns the deployment. ## OpenAPI ````yaml get /v1/models/{model_id}/deployments/{deployment_id} openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/{deployment_id}: parameters: - $ref: '#/components/parameters/model_id' - $ref: '#/components/parameters/deployment_id' get: summary: Gets a model's deployment by ID description: Gets a model's deployment by ID and returns the deployment. responses: '200': description: A deployment of a model. content: application/json: schema: $ref: '#/components/schemas/DeploymentV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true deployment_id: schema: type: string name: deployment_id in: path required: true schemas: DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-development-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Development model deployment > Gets a model's development deployment and returns the deployment. ## OpenAPI ````yaml get /v1/models/{model_id}/deployments/development openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/development: parameters: - $ref: '#/components/parameters/model_id' get: summary: Gets a model's development deployment description: Gets a model's development deployment and returns the deployment. responses: '200': description: A deployment of a model. content: application/json: schema: $ref: '#/components/schemas/DeploymentV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-production-deployment.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Production model deployment > Gets a model's production deployment and returns the deployment. ## OpenAPI ````yaml get /v1/models/{model_id}/deployments/production openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments/production: parameters: - $ref: '#/components/parameters/model_id' get: summary: Gets a model's production deployment description: Gets a model's production deployment and returns the deployment. responses: '200': description: A deployment of a model. content: application/json: schema: $ref: '#/components/schemas/DeploymentV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-all-chain-deployments.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all chain deployments ## OpenAPI ````yaml get /v1/chains/{chain_id}/deployments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains/{chain_id}/deployments: parameters: - $ref: '#/components/parameters/chain_id' get: summary: Gets all chain deployments responses: '200': description: A list of chain deployments. content: application/json: schema: $ref: '#/components/schemas/ChainDeploymentsV1' components: parameters: chain_id: schema: type: string name: chain_id in: path required: true schemas: ChainDeploymentsV1: description: A list of chain deployments. properties: deployments: description: A list of chain deployments items: $ref: '#/components/schemas/ChainDeploymentV1' title: Deployments type: array required: - deployments title: ChainDeploymentsV1 type: object ChainDeploymentV1: description: A deployment of a chain. properties: id: description: Unique identifier of the chain deployment title: Id type: string created_at: description: Time the chain deployment was created in ISO 8601 format format: date-time title: Created At type: string chain_id: description: Unique identifier of the chain title: Chain Id type: string environment: anyOf: - type: string - type: 'null' description: Environment the chain deployment is deployed in title: Environment chainlets: description: Chainlets in the chain deployment items: $ref: '#/components/schemas/ChainletV1' title: Chainlets type: array status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chain deployment required: - id - created_at - chain_id - environment - chainlets - status title: ChainDeploymentV1 type: object ChainletV1: description: A chainlet in a chain deployment. properties: id: description: Unique identifier of the chainlet title: Id type: string name: description: Name of the chainlet title: Name type: string autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the chainlet. If null, it has not finished deploying instance_type_name: description: Name of the instance type the chainlet is deployed on title: Instance Type Name type: string active_replica_count: description: Number of active replicas title: Active Replica Count type: integer status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the chainlet required: - id - name - autoscaling_settings - instance_type_name - active_replica_count - status title: ChainletV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/chains/gets-all-chains.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # All chains ## OpenAPI ````yaml get /v1/chains openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/chains: get: summary: Gets all chains responses: '200': description: A list of chains. content: application/json: schema: $ref: '#/components/schemas/ChainsV1' components: schemas: ChainsV1: description: A list of chains. properties: chains: items: $ref: '#/components/schemas/ChainV1' title: Chains type: array required: - chains title: ChainsV1 type: object ChainV1: description: A chain. properties: id: description: Unique identifier of the chain title: Id type: string created_at: description: Time the chain was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the chain title: Name type: string deployments_count: description: Number of deployments of the chain title: Deployments Count type: integer team_name: description: Name of the team associated with the chain title: Team Name type: string required: - id - created_at - name - deployments_count - team_name title: ChainV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/deployments/gets-all-deployments-of-a-model.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all model deployments ## OpenAPI ````yaml get /v1/models/{model_id}/deployments openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models/{model_id}/deployments: parameters: - $ref: '#/components/parameters/model_id' get: summary: Gets all deployments of a model responses: '200': description: A list of deployments of a model. content: application/json: schema: $ref: '#/components/schemas/DeploymentsV1' components: parameters: model_id: schema: type: string name: model_id in: path required: true schemas: DeploymentsV1: description: A list of deployments of a model. properties: deployments: description: A list of deployments of a model items: $ref: '#/components/schemas/DeploymentV1' title: Deployments type: array required: - deployments title: DeploymentsV1 type: object DeploymentV1: description: A deployment of a model. properties: id: description: Unique identifier of the deployment title: Id type: string created_at: description: Time the deployment was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the deployment title: Name type: string model_id: description: Unique identifier of the model title: Model Id type: string is_production: description: Whether the deployment is the production deployment of the model title: Is Production type: boolean is_development: description: Whether the deployment is the development deployment of the model title: Is Development type: boolean status: $ref: '#/components/schemas/DeploymentStatusV1' description: Status of the deployment active_replica_count: description: Number of active replicas title: Active Replica Count type: integer autoscaling_settings: anyOf: - $ref: '#/components/schemas/AutoscalingSettingsV1' - type: 'null' description: >- Autoscaling settings for the deployment. If null, the model has not finished deploying instance_type_name: anyOf: - type: string - type: 'null' description: Name of the instance type the model deployment is running on title: Instance Type Name environment: anyOf: - type: string - type: 'null' description: The environment associated with the deployment title: Environment required: - id - created_at - name - model_id - is_production - is_development - status - active_replica_count - autoscaling_settings - instance_type_name - environment title: DeploymentV1 type: object DeploymentStatusV1: description: The status of a deployment. enum: - BUILDING - DEPLOYING - DEPLOY_FAILED - LOADING_MODEL - ACTIVE - UNHEALTHY - BUILD_FAILED - BUILD_STOPPED - DEACTIVATING - INACTIVE - FAILED - UPDATING - SCALED_TO_ZERO - WAKING_UP title: DeploymentStatusV1 type: string AutoscalingSettingsV1: description: Autoscaling settings for a deployment. properties: min_replica: description: Minimum number of replicas title: Min Replica type: integer max_replica: description: Maximum number of replicas title: Max Replica type: integer autoscaling_window: anyOf: - type: integer - type: 'null' description: Timeframe of traffic considered for autoscaling decisions title: Autoscaling Window scale_down_delay: anyOf: - type: integer - type: 'null' description: Waiting period before scaling down any active replica title: Scale Down Delay concurrency_target: description: Number of requests per replica before scaling up title: Concurrency Target type: integer target_utilization_percentage: anyOf: - type: integer - type: 'null' description: Target utilization percentage for scaling up/down. title: Target Utilization Percentage target_in_flight_tokens: anyOf: - type: integer - type: 'null' default: null description: >- Target number of in-flight tokens for autoscaling decisions. Early access only. title: Target In Flight Tokens required: - min_replica - max_replica - autoscaling_window - scale_down_delay - concurrency_target - target_utilization_percentage title: AutoscalingSettingsV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/instance-types/gets-all-instance-types.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # All instance types ## OpenAPI ````yaml get /v1/instance_types openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/instance_types: get: summary: Gets all available instance types responses: '200': description: A list of instance types. content: application/json: schema: $ref: '#/components/schemas/InstanceTypesV1' components: schemas: InstanceTypesV1: description: A list of instance types. properties: instance_types: items: $ref: '#/components/schemas/InstanceTypeV1' title: Instance Types type: array required: - instance_types title: InstanceTypesV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/models/gets-all-models.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # All models ## OpenAPI ````yaml get /v1/models openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/models: get: summary: Gets all models responses: '200': description: A list of models. content: application/json: schema: $ref: '#/components/schemas/ModelsV1' components: schemas: ModelsV1: description: A list of models. properties: models: items: $ref: '#/components/schemas/ModelV1' title: Models type: array required: - models title: ModelsV1 type: object ModelV1: description: A model. properties: id: description: Unique identifier of the model title: Id type: string created_at: description: Time the model was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the model title: Name type: string deployments_count: description: Number of deployments of the model title: Deployments Count type: integer production_deployment_id: anyOf: - type: string - type: 'null' description: Unique identifier of the production deployment of the model title: Production Deployment Id development_deployment_id: anyOf: - type: string - type: 'null' description: Unique identifier of the development deployment of the model title: Development Deployment Id instance_type_name: description: Name of the instance type for the production deployment of the model title: Instance Type Name type: string team_name: description: Name of the team associated with the model. title: Team Name type: string required: - id - created_at - name - deployments_count - production_deployment_id - development_deployment_id - instance_type_name - team_name title: ModelV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/secrets/gets-all-secrets.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all secrets ## OpenAPI ````yaml get /v1/secrets openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/secrets: get: summary: Gets all secrets responses: '200': description: A list of Baseten secrets. content: application/json: schema: $ref: '#/components/schemas/SecretsV1' components: schemas: SecretsV1: description: A list of Baseten secrets. properties: secrets: items: $ref: '#/components/schemas/SecretV1' title: Secrets type: array required: - secrets title: SecretsV1 type: object SecretV1: description: A Baseten secret. Note that we do not support retrieving secret values. properties: created_at: description: Time the secret was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the secret title: Name type: string team_name: description: Name of the team the secret belongs to title: Team Name type: string required: - created_at - name - team_name title: SecretV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/teams/gets-all-team-secrets.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Get all team secrets ## OpenAPI ````yaml get /v1/teams/{team_id}/secrets openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/teams/{team_id}/secrets: parameters: - $ref: '#/components/parameters/team_id' get: summary: Gets all secrets for a team responses: '200': description: A list of Baseten secrets. content: application/json: schema: $ref: '#/components/schemas/SecretsV1' components: parameters: team_id: schema: type: string name: team_id in: path required: true schemas: SecretsV1: description: A list of Baseten secrets. properties: secrets: items: $ref: '#/components/schemas/SecretV1' title: Secrets type: array required: - secrets title: SecretsV1 type: object SecretV1: description: A Baseten secret. Note that we do not support retrieving secret values. properties: created_at: description: Time the secret was created in ISO 8601 format format: date-time title: Created At type: string name: description: Name of the secret title: Name type: string team_name: description: Name of the team the secret belongs to title: Team Name type: string required: - created_at - name - team_name title: SecretV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/reference/management-api/instance-types/gets-instance-type-prices.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Instance type prices ## OpenAPI ````yaml get /v1/instance_type_prices openapi: 3.1.0 info: description: REST API for management of Baseten resources title: Baseten management API version: 1.0.0 servers: - url: https://api.baseten.co security: - ApiKeyAuth: [] paths: /v1/instance_type_prices: get: summary: Gets prices for available instance types. responses: '200': description: A list of instance types. content: application/json: schema: $ref: '#/components/schemas/InstanceTypePricesV1' components: schemas: InstanceTypePricesV1: description: A list of instance types. properties: instance_types: items: $ref: '#/components/schemas/InstanceTypeWithPriceV1' title: Instance Types type: array required: - instance_types title: InstanceTypePricesV1 type: object InstanceTypeWithPriceV1: properties: instance_type: $ref: '#/components/schemas/InstanceTypeV1' description: Instance type properties. price: description: Usage price in USD / minute. title: Price type: number required: - instance_type - price title: InstanceTypeWithPriceV1 type: object InstanceTypeV1: description: An instance type. properties: id: description: Identifier string for the instance type title: Id type: string name: description: Display name of the instance type title: Name type: string memory_limit_mib: description: Memory limit of the instance type in Mebibytes title: Memory Limit Mib type: integer millicpu_limit: description: CPU limit of the instance type in millicpu title: Millicpu Limit type: integer gpu_count: description: Number of GPUs on the instance type title: Gpu Count type: integer gpu_type: anyOf: - type: string - type: 'null' description: Type of GPU on the instance type title: Gpu Type gpu_memory_limit_mib: anyOf: - type: integer - type: 'null' description: Memory limit of the GPU on the instance type in Mebibytes title: Gpu Memory Limit Mib required: - id - name - memory_limit_mib - millicpu_limit - gpu_count - gpu_type - gpu_memory_limit_mib title: InstanceTypeV1 type: object securitySchemes: ApiKeyAuth: type: apiKey in: header name: Authorization description: >- You must specify the scheme 'Api-Key' in the Authorization header. For example, `Authorization: Api-Key ` ```` --- # Source: https://docs.baseten.co/training/getting-started.md # Source: https://docs.baseten.co/development/chain/getting-started.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Your first Chain > Build and deploy two example Chains This quickstart guide contains instructions for creating two Chains: 1. A simple CPU-only “hello world”-Chain. 2. A Chain that implements Phi-3 Mini and uses it to write poems. ## Prerequisites To use Chains, install a recent Truss version and ensure pydantic is v2: ```bash theme={"system"} pip install --upgrade truss 'pydantic>=2.0.0' ``` Truss requires python `>=3.9,<3.15`. To set up a fresh development environment, you can use the following commands, creating a environment named `chains_env` using `pyenv`: ```bash theme={"system"} curl https://pyenv.run | bash echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc source ~/.bashrc pyenv install 3.11.0 ENV_NAME="chains_env" pyenv virtualenv 3.11.0 $ENV_NAME pyenv activate $ENV_NAME pip install --upgrade truss 'pydantic>=2.0.0' ``` To deploy Chains remotely, you also need a [Baseten account](https://app.baseten.co/signup). It is handy to export your API key to the current shell session or permanently in your `.bashrc`: ```bash ~/.bashrc theme={"system"} export BASETEN_API_KEY="nPh8..." ``` ## Example: Hello World Chains are written in Python files. In your working directory, create `hello_chain/hello.py`: ```sh theme={"system"} mkdir hello_chain cd hello_chain touch hello.py ``` In the file, we'll specify a basic Chain. It has two Chainlets: * `HelloWorld`, the entrypoint, which handles the input and output. * `RandInt`, which generates a random integer. It is used a as a dependency by `HelloWorld`. Via the entrypoint, the Chain takes a maximum value and returns the string " Hello World!" repeated a variable number of times. ```python hello.py theme={"system"} import random import truss_chains as chains class RandInt(chains.ChainletBase): async def run_remote(self, max_value: int) -> int: return random.randint(1, max_value) @chains.mark_entrypoint class HelloWorld(chains.ChainletBase): def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None: self._rand_int = rand_int async def run_remote(self, max_value: int) -> str: num_repetitions = await self._rand_int.run_remote(max_value) return "Hello World! " * num_repetitions ``` ### The Chainlet class-contract Exactly one Chainlet must be marked as the entrypoint with the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator. This Chainlet is responsible for handling public-facing input and output for the whole Chain in response to an API call. A Chainlet class has a single public method, [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which is the API endpoint for the entrypoint Chainlet and the function that other Chainlets can use as a dependency. The [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method must be fully type-annotated with primitive python types or [pydantic models](https://docs.pydantic.dev/latest/). Chainlets cannot be naively instantiated. The only correct usages are: 1. Make one Chainlet depend on another one via the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) directive as an `__init__`-argument as shown above for the `RandInt` Chainlet. 2. In the [local debugging mode](/development/chain/localdev#test-a-chain-locally). Beyond that, you can structure your code as you like, with private methods, imports from other files, and so forth. Keep in mind that Chainlets are intended for distributed, replicated, remote execution, so using global variables, global state, and certain Python features like importing modules dynamically at runtime should be avoided as they may not work as intended. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push hello.py ``` The deploy command results in an output like this: ``` ⛓️ HelloWorld - Chainlets ⛓️ ╭──────────────────────┬─────────────────────────┬─────────────╮ │ Status │ Name │ Logs URL │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ HelloWorld (entrypoint) │ https://... │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ RandInt (dep) │ https://... │ ╰──────────────────────┴─────────────────────────┴─────────────╯ Deployment succeeded. You can run the chain with: curl -X POST 'https://chain-.../run_remote' \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"max_value": 10}' # "Hello World! Hello World! Hello World! " ``` ## Example: Poetry with LLMs Our second example also has two Chainlets, but is somewhat more complex and realistic. The Chainlets are: * `PoemGenerator`, the entrypoint, which handles the input and output and orchestrates calls to the LLM. * `PhiLLM`, which runs inference on Phi-3 Mini. This Chain takes a list of words and returns a poem about each word, written by Phi-3. Here's the architecture: We build this Chain in a new working directory (if you are still inside `hello_chain/`, go up one level with `cd ..` first): ```sh theme={"system"} mkdir poetry_chain cd poetry_chain touch poems.py ``` A similar end-to-end code example, using Mistral as an LLM, is available in the [examples repo](https://github.com/basetenlabs/model/tree/main/truss-chains/examples/mistral). ### Building the LLM Chainlet The main difference between this Chain and the previous one is that we now have an LLM that needs a GPU and more complex dependencies. Copy the following code into `poems.py`: ```python poems.py theme={"system"} import asyncio from typing import List import pydantic import truss_chains as chains from truss import truss_config PHI_HF_MODEL = "microsoft/Phi-3-mini-4k-instruct" # This configures to cache model weights from the hunggingface repo # in the docker image that is used for deploying the Chainlet. PHI_CACHE = truss_config.ModelRepo( repo_id=PHI_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"] ) class Messages(pydantic.BaseModel): messages: List[dict[str, str]] class PhiLLM(chains.ChainletBase): # `remote_config` defines the resources required for this chainlet. remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( # The phi model needs some extra python packages. pip_requirements=[ "accelerate==0.30.1", "einops==0.8.0", "transformers==4.41.2", "torch==2.3.0", ] ), # The phi model needs a GPU and more CPUs. compute=chains.Compute(cpu_count=2, gpu="T4"), # Cache the model weights in the image assets=chains.Assets(cached=[PHI_CACHE]), ) def __init__(self) -> None: # Note the imports of the *specific* python requirements are # pushed down to here. This code will only be executed on the # remotely deployed Chainlet, not in the local environment, # so we don't need to install these packages in the local # dev environment. import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) self._generate_args = { "max_new_tokens" : 512, "temperature" : 1.0, "top_p" : 0.95, "top_k" : 50, "repetition_penalty" : 1.0, "no_repeat_ngram_size": 0, "use_cache" : True, "do_sample" : True, "eos_token_id" : self._tokenizer.eos_token_id, "pad_token_id" : self._tokenizer.pad_token_id, } async def run_remote(self, messages: Messages) -> str: import torch model_inputs = self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = self._model.generate( input_ids=input_ids, **self._generate_args) output_text = self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` ### Building the entrypoint Now that we have an LLM, we can use it in a poem generator Chainlet. Add the following code to `poems.py`: ```python poems.py theme={"system"} import asyncio @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm async def run_remote(self, words: list[str]) -> list[str]: tasks = [] for word in words: messages = Messages( messages=[ { "role" : "system", "content": ( "You are poet who writes short, " "lighthearted, amusing poetry." ), }, {"role": "user", "content": f"Write a poem about {word}"}, ] ) tasks.append( asyncio.ensure_future(self._phi_llm.run_remote(messages))) await asyncio.sleep(0) # Yield to event loop, to allow starting tasks. return list(await asyncio.gather(*tasks)) ``` Note that we use `asyncio.ensure_future` around each RPC to the LLM chainlet. This makes the current python process start these remote calls concurrently, i.e. the next call is started before the previous one has finished and we can minimize our overall runtime. In order to await the results of all calls, `asyncio.gather` is used which gives us back normal python objects. If the LLM is hit with many concurrent requests, it can auto-scale up (if autoscaling is configured). More advanced LLM models have batching capabilities, so for those even a single instance can serve concurrent request. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push poems.py ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"words": ["bird", "plane", "superman"]}' #[[ #" [INST] Generate a poem about: bird [/INST] In the quiet hush of...", #" [INST] Generate a poem about: plane [/INST] In the vast, boundless...", #" [INST] Generate a poem about: superman [/INST] In the realm where..." #]] ``` --- # Source: https://docs.baseten.co/observability/export-metrics/grafana.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Export to Grafana Cloud > Export metrics from Baseten to Grafana Cloud The Baseten + Grafana Cloud integration enables you to get real-time inference metrics within your existing Grafana setup. ## Video tutorial See below for step-by-step details from the video.